How to calculate the MD5 hash of a large file in C? - c

I am writing in C using OpenSSL library.
How can I calculate hash of a large file using md5?
As I know, I need to load a whole file to RAM as char array and then call the hash function. But what if the file is about 4Gb long? Sounds like a bad idea.
SOLVED: Thanks to askovpen, I found my bug. I've used
while ((bytes = fread (data, 1, 1024, inFile)) != 0)
MD5_Update (&mdContext, data, 1024);
not
while ((bytes = fread (data, 1, 1024, inFile)) != 0)
MD5_Update (&mdContext, data, bytes);

example
gcc -g -Wall -o file file.c -lssl -lcrypto
#include <stdio.h>
#include <openssl/md5.h>
int main()
{
unsigned char c[MD5_DIGEST_LENGTH];
char *filename="file.c";
int i;
FILE *inFile = fopen (filename, "rb");
MD5_CTX mdContext;
int bytes;
unsigned char data[1024];
if (inFile == NULL) {
printf ("%s can't be opened.\n", filename);
return 0;
}
MD5_Init (&mdContext);
while ((bytes = fread (data, 1, 1024, inFile)) != 0)
MD5_Update (&mdContext, data, bytes);
MD5_Final (c,&mdContext);
for(i = 0; i < MD5_DIGEST_LENGTH; i++) printf("%02x", c[i]);
printf (" %s\n", filename);
fclose (inFile);
return 0;
}
result:
$ md5sum file.c
25a904b0e512ee546b3f47574703d9fc file.c
$ ./file
25a904b0e512ee546b3f47574703d9fc file.c

First, MD5 is a hashing algorithm. It doesn't encrypt anything.
Anyway, you can read the file in chunks of whatever size you like. Call MD5_Init once, then call MD5_Update with each chunk of data you read from the file. When you're done, call MD5_Final to get the result.

You don't have to load the entire file in memory at once. You can use the functions MD5_Init(), MD5_Update() and MD5_Final() to process it in chunks to produce the hash. If you are worried about making it an "atomic" operation, it may be necessary to lock the file to prevent someone else changing it during the operation.

The top answer is correct, but didn't mention something: The value of the hash will be different for each buffer size used. The value will be consistent across hashes, so the same buffer size will produce the same hash everytime, however if this hash will be compared against a hash of the same data at a later time, the same buffer size must be used for each call.
In addition, if you want to make sure your digest code functions correctly, and go online to compare your hash with the online hashing websites, it appears they use a buffer length of 1. This also brings an interesting thought: It is perfectly acceptable to use a buffer length of 1 to hash a large file, it will just take longer (duh).
So my rule of thumb is if it's only for internal use, then I can set the buffer length accordingly for a large file, but if it has to play nice with other systems, then set the buffer length to 1 and deal with the time consequence.
int hashTargetFile(FILE* fp, unsigned char** md_value, int *md_len) {
#define FILE_BUFFER_LENGTH 1
EVP_MD_CTX *mdctx;
const EVP_MD *md;
int diglen; //digest length
int arrlen = sizeof(char)*EVP_MAX_MD_SIZE + 1;
int arrlen2 = sizeof(char)*FILE_BUFFER_LENGTH + 1;
unsigned char *digest_value = (char*)malloc(arrlen);
char *data = (char*)malloc(arrlen2);
size_t bytes; //# of bytes read from file
mdctx = EVP_MD_CTX_new();
md = EVP_sha512();
if (!mdctx) {
fprintf(stderr, "Error while creating digest context.\n");
return 0;
}
if (!EVP_DigestInit_ex(mdctx, md, NULL)) {
fprintf(stderr, "Error while initializing digest context.\n");
return 0;
}
while (bytes = fread(data, 1, FILE_BUFFER_LENGTH, fp) != 0) {
if (!EVP_DigestUpdate(mdctx, data, bytes)) {
fprintf(stderr, "Error while digesting file.\n");
return 0;
}
}
if (!EVP_DigestFinal_ex(mdctx, digest_value, &diglen)) {
fprintf(stderr, "Error while finalizing digest.\n");
return 0;
}
*md_value = digest_value;
*md_len = diglen;
EVP_MD_CTX_free(mdctx);
return 1;
}

Related

Copy Function in C not creating matching Checksums

I written a simple copy program that copies a file and generates an MD5, It runs and generates the MD5 correctly.
However when verifying the file generated by the copy function it does not match the source MD5. I can't see any reason for this in my code, can anyone help?
#include <stdio.h>
#include <openssl/md5.h>
#include <assert.h>
#define BUFFER_SIZE 512
int secure_copy(char *filepath, char *destpath);
int main(int argc, char * argv[]) {
secure_copy(argv[1], argv[2]);
return 0;
}
int secure_copy(char *filepath, char *destpath) {
FILE *src, *dest;
src = fopen(filepath, "r");
assert(src != NULL);
dest = fopen(destpath, "w");
assert(dest != 0);
MD5_CTX c;
char buf[BUFFER_SIZE];
ssize_t bytes, out_writer;
unsigned char out[MD5_DIGEST_LENGTH];
MD5_Init(&c);
while((bytes = fread(buf, 1, BUFFER_SIZE, src)) != 0) {
MD5_Update(&c, buf, bytes);
out_writer = fwrite(buf, 1, BUFFER_SIZE, dest);
assert(out_writer != 0);
}
MD5_Final(out, &c);
printf("MD5: ");
for (int i=0; i < MD5_DIGEST_LENGTH; i++)
{
printf("%02x", out[i]);
}
printf("\n");
fclose(src);
fclose(dest);
return 0;
}
Output
$ ./md5speed doc.txt /home/doc.txt
MD5: 4c55e4b9185eece3cc000c4023f8f6fe
when verifying the copied file with md5sum I get a completely different hash.
md5sum doc.txt
29cb4da30c3e28fdb81463b5f0a76894 doc.txt
Though the file still opens and content is uncorrupted.
regarding:
while((bytes = fread(buf, 1, BUFFER_SIZE, src)) != 0)
and
out_writer = fwrite(buf, 1, BUFFER_SIZE, dest);
on the last read, the amount read can be less than BUFFER_SIZE so should always use bytes variable for the number of bytes to write.
Also, certain errors can occur when calling fread() and/or fwrite() Such errors are indicated by negative values (and/or values less than the 3rd parameter to those functions) in the returned variables (bytes, outwriter). The code, to be robust, must be checking those values and handling any errors that occur, including EOF
As stated in comments, changing the fwrite function to use bytes as opposed to BUFFER_SIZE combined with changing file operations mode "rb" and "wb" to binary.

C in UNIX: Reading/combining files based upon number of bytes

I am trying to fix the code below to only read the first few N bytes. I would also like to do the same thing, but for the last number of N bytes (I assume that would involve just adding a '-' in front of the number of bytes N). I am not sure if using fget is the correct method for doing so.
I tried changing the 1000 in
while(fgets(buffer, 1000, fp)
however I do not think changing that value will pick up a certain number of bytes, as I have read that it is only a maximum value.
char buffer[1001];
int main(int argc, char** argv) {
bzero(buffer, sizeof(buffer));
for(int x=1; x<argc; x++) {
FILE *fp = fopen(argv[x], "r+");
if (fp) {
while(fgets(buffer, 1000, fp)) {
printf("%s", buffer);
}
} else {
printf("could not open file %s\n", argv[x]);
}
}
}
Assuming that you want the first 1000 bytes and the last 1000 bytes of a file, and largely ignoring problems with files smaller than 2000 bytes (it works, but you might want a different result), you could use:
#include <stdio.h>
enum { NUM_BYTES = 1000 };
int main(int argc, char **argv)
{
for (int x = 1; x < argc; x++)
{
FILE *fp = fopen(argv[x], "r");
if (fp)
{
char buffer[NUM_BYTES];
int nbytes = fread(buffer, 1, NUM_BYTES, fp);
fwrite(buffer, 1, nbytes, stdout);
if (fseek(fp, -NUM_BYTES, SEEK_END) == 0)
{
nbytes = fread(buffer, 1, NUM_BYTES, fp);
fwrite(buffer, 1, nbytes, stdout);
}
fclose(fp);
}
else
{
fprintf(stderr, "%s: could not open file %s\n", argv[0], argv[x]);
}
}
}
This uses fread(), fwrite() and fseek() as suggested in the comments.
It also takes care to close successfully opened files. It does not demand write permissions on the files since it only reads and does not write those files (using "r" instead of "r+" in the call to fopen()).
If the file is smaller than 1000 bytes, the fseek() will fail because it tries to seek to a negative offset. If that happens, don't bother to read or write another 1000 bytes.
I debated whether to use sizeof(buffer) or NUM_BYTES in the function calls. I decided that NUM_BYTES was better, but the choice is not definitive — there are cogent arguments for using sizeof(buffer) instead.
Note that buffer becomes a local variable. There's no need to zero it; only the entries that are written on by fread() will be written by fwrite(), so there is no problem resolved by bzero(). (There doubly wasn't any point in that when the variable was global; variables with static duration are default initialized to all bytes zero anyway.)
The error message is written to standard error.
The code doesn't check for zero bytes read; arguably, it should.
If the NUM_BYTES becomes a parameter (e.g. you call your program fl19 and use fl19 -n 200 file1 to print the first and last 200 bytes of file1), then you need to do some tidying up as well as command-line argument handling.

How to parse files that cannot fit entirely in memory RAM

I have created a framework to parse text files of reasonable size that can fit in memory RAM, and for now, things are going well. I have no complaints, however what if I encountered a situation where I have to deal with large files, say, greater than 8GB(which is the size of mine)?
What would be an efficient approach to deal with such large files?
My framework:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int Parse(const char *filename,
const char *outputfile);
int main(void)
{
clock_t t1 = clock();
/* ............................................................................................................................. */
Parse("file.txt", NULL);
/* ............................................................................................................................. */
clock_t t2 = clock();
fprintf(stderr, "time elapsed: %.4f\n", (double)(t2 - t1) / CLOCKS_PER_SEC);
fprintf(stderr, "Press any key to continue . . . ");
getchar();
return 0;
}
long GetFileSize(FILE * fp)
{
long f_size;
fseek(fp, 0L, SEEK_END);
f_size = ftell(fp);
fseek(fp, 0L, SEEK_SET);
return f_size;
}
char *dump_file_to_array(FILE *fp,
size_t f_size)
{
char *buf = (char *)calloc(f_size + 1, 1);
if (buf) {
size_t n = 0;
while (fgets(buf + n, INT_MAX, fp)) {
n += strlen(buf + n);
}
}
return buf;
}
int Parse(const char *filename,
const char *outputfile)
{
/* open file for reading in text mode */
FILE *fp = fopen(filename, "r");
if (!fp) {
perror(filename);
return 1;
}
/* store file in dynamic memory and close file */
size_t f_size = GetFileSize(fp);
char *buf = dump_file_to_array(fp, f_size);
fclose(fp);
if (!buf) {
fputs("error: memory allocation failed.\n", stderr);
return 2;
}
/* state machine variables */
// ........
/* array index variables */
size_t x = 0;
size_t y = 0;
/* main loop */
while (buf[x]) {
switch (buf[x]) {
/* ... */
}
x++;
}
/* NUL-terminate array at y */
buf[y] = '\0';
/* write buffer to file and clean up */
outputfile ? fp = fopen(outputfile, "w") :
fp = fopen(filename, "w");
if (!fp) {
outputfile ? perror(outputfile) :
perror(filename);
}
else {
fputs(buf, fp);
fclose(fp);
}
free(buf);
return 0;
}
Pattern deletion function based on the framework:
int delete_pattern_in_file(const char *filename,
const char *pattern, const char *outputfile)
{
/* open file for reading in text mode */
FILE *fp = fopen(filename, "r");
if (!fp) {
perror(filename);
return 1;
}
/* copy file contents to buffer and close file */
size_t f_size = GetFileSize(fp);
char *buf = dump_file_to_array(fp, f_size);
fclose(fp);
if (!buf) {
fputs("error - memory allocation failed", stderr);
return 2;
}
/* delete first match */
size_t n = 0, pattern_len = strlen(pattern);
char *tmp, *ptr = strstr(buf, pattern);
if (!ptr) {
fputs("No match found.\n", stderr);
free(buf);
return -1;
}
else {
n = ptr - buf;
ptr += pattern_len;
tmp = ptr;
}
/* delete the rest */
while (ptr = strstr(ptr, pattern)) {
while (tmp < ptr) {
buf[n++] = *tmp++;
}
ptr += pattern_len;
tmp = ptr;
}
/* copy the rest of the buffer */
strcpy(buf + n, tmp);
/* open file for writing and print the processed buffer to it */
outputfile ? fp = fopen(outputfile, "w") :
fp = fopen(filename, "w");
if (!fp) {
outputfile ? perror(outputfile) :
perror(filename);
}
else {
fputs(buf, fp);
fclose(fp);
}
free(buf);
return 0;
}
If you wish to stick with your current design, an option might be to mmap() the file instead of reading it into a memory buffer.
You could change the function dump_file_to_array to the following (linux-specific):
char *dump_file_to_array(FILE *fp, size_t f_size) {
buf = mmap(NULL, f_size, PROT_READ, MAP_SHARED, fileno(fp), 0);
if (buf == MAP_FAILED)
return NULL;
return buf;
}
Now you can read over the file, the memory manager will take automatically care to only hold the relevant potions of the file in memory.
For Windows, similar mechanisms exist.
Chances you are parsing the file line-by line. So read in a large block (4k or 16k) and parse all the lines in that. Copy the small remainder to the beginning of the 4k or 16k buffer and read in the rest of the buffer. Rinse and repeat.
For JSON or XML you will need an event based parser that can accept multiple blocks or input.
There are multiple issues with your approach.
The concept of maximum and available memory are not so evident: technically, you are not limited by the RAM size, but by the quantity of memory your environment will let you allocate and use for your program. This depends on various factors:
What ABI you compile for: the maximum memory size accessible to your program is limited to less than 4 GB if you compile for 32-bit code, even if your system has more RAM than that.
What quota the system is configured to let your program use. This may be less than available memory.
What strategy the system uses when more memory is requested than is physically available: most modern systems use virtual memory and share physical memory between processes and system tasks (such as the disk cache) using very advanced algorithms that cannot be describe in a few lines. It is possible on some systems for your program to allocate and use more memory than is physically installed on the motherboard, swapping memory pages to disk as more memory is accessed, at a huge cost in lag time.
There are further issues in your code:
The type long might be too small to hold the size of the file: on Windows systems, long is 32-bit even on 64-bit versions where memory can be allocated in chunks larger than 2GB. You must use different API to request the file size from the system.
You read the file with an series of calls to fgets(). This is inefficient, a single call to fread() would suffice. Furthermore, if the file contains embedded null bytes ('\0' characters), chunks from the file will be missing in memory. However you could not deal with embedded null bytes if you use string functions such as strstr() and strcpy() to handle your string deletion task.
the condition in while (ptr = strstr(ptr, pattern)) is an assignment. While not strictly incorrect, it is poor style as it confuses readers of your code and prevents life saving warnings by the compiler where such assignment-conditions are coding errors. You might think that could never happen, but anyone can make a typo and a missing = in a test is difficult to spot and has dire consequences.
you short-hand use of the ternary operator in place of if statements is quite confusing too: outputfile ? fp = fopen(outputfile, "w") : fp = fopen(filename, "w");
rewriting the input file in place is risky too: if anything goes wrong, the input file will be lost.
Note that you can implement the filtering on the fly, without a buffer, albeit inefficiently:
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[]) {
if (argc < 2) {
fprintf(stderr, "usage: delpat PATTERN < inputfile > outputfile\n");
return 1;
}
unsigned char *pattern = (unsigned char*)argv[1];
size_t i, j, n = strlen(argv[1]);
size_t skip[n + 1];
int c;
skip[0] = 0;
for (i = j = 1; i < n; i++) {
while (memcmp(pattern, pattern + j, i - j)) {
j++;
}
skip[i] = j;
}
i = 0;
while ((c = getchar()) != EOF) {
for (;;) {
if (i < n && c == pattern[i]) {
if (++i == n) {
i = 0; /* match found, consumed */
}
break;
}
if (i == 0) {
putchar(c);
break;
}
for (j = 0; j < skip[i]; j++) {
putchar(pattern[j]);
}
i -= skip[i];
}
}
for (j = 0; j < i; j++) {
putchar(pattern[j]);
}
return 0;
}
First of all I wouldn't suggest holding such big files in RAM but instead using streams. This because buffering is usually done by the library as well as by the kernel.
If you are accessing the file sequentially, which seems to be the case, then you probably know that all modern systems implement read-ahead algorithms so just reading the whole file ahead of time IN RAM may in most cases just waste time.
You didn't specify the use-cases you have to cover so I'm going to have to assume that using streams like
std::ifstream
and doing the parsing on the fly will suit your needs. As a side note, also make sure your operations on files that are expected to be large are done in separate threads.
An alternative solution: If you're on linux systems, and you have a decent amount of swap space, just open the whole bad boy up. It will consume your ram and also consume harddrive space (swap). Thus you can have the entire thing open at once, just not all of it will be on the ram.
Pros
If an unexpected shut down occurred, the memory on the swap space is recoverable.
RAM is expensive, HDDs are cheap, so the application would put less strain on your expensive equipment
Virus could not harm your computer because there would be no room in RAM for them to run
You'll be taking full advantage of the Linux operating system by using the swap space. Normally the swap space module is not used and all it does is clog up precious ram.
The additional energy that is needed to utilize the entirety of the ram can warm the immediate area. Useful during winter time
You can add "Complex and Special Memory Allocation Engineering" to your resume.
Cons
None
Consider treating the file as an external array of lines.
Code can use an array of line indexes. This index array can be kept in memory at a fraction of the size of the large file. Access to any line is accomplished quickly via this lookup, a seek with fsetpos() and an fread()/fgets(). As the lines are edited, the new lines can be saved, in any order, in temporary text file. Saving of the file reads both the original file and temp one in sequence to form and write the new file.
typedef struct {
int attributes; // not_yet_read, line_offset/length_determined,
// line_changed/in_other_file, deleted, etc.
fpos_t line_offset; // use with fgetpos() fsetpos()
unsigned line_length; // optional field as code could re-compute as needed.
} line_index;
size_t line_count;
// read some lines
line_index *index = malloc(sizeof *index * line_count);
// read more lines
index = realloc(index, sizeof *index * line_count);
// edit lines, save changes to appended temporary file.
// ...
// Save file -weave the contents of the source file and temp file to the new output file.
Additionally, with enormous files, the array line_index[] itself can be realized in disk memory too. Access to is easily computed. In an extreme sense, only 1 line of the file needs to in memory at any time.
You mentioned state-machine. Every finite-state-automata can be optimized to have minimal (or no) lookahead.
Is it possible to do this in Lex? It will generate output c file which you can compile.
If you don't want to use Lex, you can always do following:
Read n chars into (ring?) buffer where n is size of pattern.
Try to match buffer with pattern
If match goto 1
Print buffer[0], read char, goto 2
Also for very long patterns and degenerate inputs strstr can be slow. In that case you might want to look into more advanced sting matching aglorithms.
mmap() is a pretty good way of working on files with large sizes.
It provides you with lot of flexibility but you need to be cautious with page size. Here is a good article which talks about more specifics.

C, Segmentation fault parsing large csv file

I wrote a simple program that would open a csv file, read it, make a new csv file, and only write some of the columns (I don't want all of the columns and am hoping removing some will make the file more manageable). The file is 1.15GB, but fopen() doesn't have a problem with it. The segmentation fault happens in my while loop shortly after the first progress printf().
I tested on just the first few lines of the csv and the logic below does what I want. The strange section for when index == 0 is due to the last column being in the form (xxx, yyy)\n (the , in a comma separated value file is just ridiculous).
Here is the code, the while loop is the problem:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv) {
long size;
FILE* inF = fopen("allCrimes.csv", "rb");
if (!inF) {
puts("fopen() error");
return 0;
}
fseek(inF, 0, SEEK_END);
size = ftell(inF);
rewind(inF);
printf("In file size = %ld bytes.\n", size);
char* buf = malloc((size+1)*sizeof(char));
if (fread(buf, 1, size, inF) != size) {
puts("fread() error");
return 0;
}
fclose(inF);
buf[size] = '\0';
FILE *outF = fopen("lessColumns.csv", "w");
if (!outF) {
puts("fopen() error");
return 0;
}
int index = 0;
char* currComma = strchr(buf, ',');
fwrite(buf, 1, (int)(currComma-buf), outF);
int progress = 0;
while (currComma != NULL) {
index++;
index = (index%14 == 0) ? 0 : index;
progress++;
if (progress%1000 == 0) printf("%d\n", progress/1000);
int start = (int)(currComma-buf);
currComma = strchr(currComma+1, ',');
if (!currComma) break;
if ((index >= 3 && index <= 10) || index == 13) continue;
int end = (int)(currComma-buf);
int endMinusStart = end-start;
char* newEntry = malloc((endMinusStart+1)*sizeof(char));
strncpy(newEntry, buf+start, endMinusStart);
newEntry[end+1] = '\0';
if (index == 0) {
char* findNewLine = strchr(newEntry, '\n');
int newLinePos = (int)(findNewLine-newEntry);
char* modifiedNewEntry = malloc((strlen(newEntry)-newLinePos+1)*sizeof(char));
strcpy(modifiedNewEntry, newEntry+newLinePos);
fwrite(modifiedNewEntry, 1, strlen(modifiedNewEntry), outF);
}
else fwrite(newEntry, 1, end-start, outF);
}
fclose(outF);
return 0;
}
Edit: It turned out the problem was that the csv file had , in places I was not expecting which caused the logic to fail. I ended up writing a new parser that removes lines with the incorrect number of commas. It removed 243,875 lines (about 4% of the file). I'll post that code instead as it at least reflects some of the comments about free():
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char** argv) {
long size;
FILE* inF = fopen("allCrimes.csv", "rb");
if (!inF) {
puts("fopen() error");
return 0;
}
fseek(inF, 0, SEEK_END);
size = ftell(inF);
rewind(inF);
printf("In file size = %ld bytes.\n", size);
char* buf = malloc((size+1)*sizeof(char));
if (fread(buf, 1, size, inF) != size) {
puts("fread() error");
return 0;
}
fclose(inF);
buf[size] = '\0';
FILE *outF = fopen("uniformCommaCount.csv", "w");
if (!outF) {
puts("fopen() error");
return 0;
}
int numOmitted = 0;
int start = 0;
while (1) {
char* currNewLine = strchr(buf+start, '\n');
if (!currNewLine) {
puts("Done");
break;
}
int end = (int)(currNewLine-buf);
char* entry = malloc((end-start+2)*sizeof(char));
strncpy(entry, buf+start, end-start+1);
entry[end-start+1] = '\0';
int commaCount = 0;
char* commaPointer = entry;
for (; *commaPointer; commaPointer++) if (*commaPointer == ',') commaCount++;
if (commaCount == 14) fwrite(entry, 1, end-start+1, outF);
else numOmitted++;
free(entry);
start = end+1;
}
fclose(outF);
printf("Omitted %d lines\n", numOmitted);
return 0;
}
you're malloc'ing but never freeing. possibly you run out of memomry, one of your mallocs returns NULL, and the subsequent call to str(n)cpy segfaults.
adding free(newEntry);, free(modifiedNewEntry); immediately after the respective fwrite calls should solve your memory shortage.
also note that inside your loop you compute offsets into the buffer buf which contains the whole file. these offsets are held in variables of type int whose maximum value on your system may be too small for the numbers you are handling. also note that adding large ints may result in a negative value which is another possible cause of the segfault (negative offsets into buf take you to some address outside the buffer possibly not even readable).
The malloc(3) function can (and sometimes does) fail.
At least code something like
char* buf = malloc(size+1);
if (!buf) {
fprintf(stderr, "failed to malloc %d bytes - %s\n",
size+1, strerror(errno));
exit (EXIT_FAILURE);
}
And I strongly suggest to clear with memset(buf, 0, size+1) the successful result of a malloc (or otherwise use calloc ....), not only because the following fread could fail (which you are testing) but to ease debugging and reproducibility.
and likewise for every other calls to malloc or calloc (you should always test them against failure)....
Notice that by definition sizeof(char) is always 1. Hence I removed it.
As others pointed out, you have a memory leak because you don't call free appropriately. A tool like valgrind could help.
You need to learn how to use the debugger (e.g. gdb). Don't forget to compile with all warnings and debugging information (e.g. gcc -Wall -g). And improve your code till you get no warnings.
Knowing how to use a debugger is an essential required skill when programming (particularly in C or C++). That debugging skill (and ability to use the debugger) will be useful in every C or C++ program you contribute to.
BTW, you could read your file line by line with getline(3) (which can also fail and you should test that).

copying contents of a text file in c

I want to read a text file and transfer it's contents to another text file in c, Here is my code:
char buffer[100];
FILE* rfile=fopen ("myfile.txt","r+");
if(rfile==NULL)
{
printf("couldn't open File...\n");
}
fseek(rfile, 0, SEEK_END);
size_t file_size = ftell(rfile);
printf("%d\n",file_size);
fseek(rfile,0,SEEK_SET);
fread(buffer,file_size,1,rfile);
FILE* pFile = fopen ( "newfile.txt" , "w+" );
fwrite (buffer , 1 ,sizeof(buffer) , pFile );
fclose(rfile);
fclose (pFile);
return 0;
}
the problem that I am facing is the appearence of unnecessary data in the receiving file,
I tried the fwrite function with both "sizeof(buffer)" and "file_size",In the first case it is displaying greater number of useless characters while in the second case the number of useless characters is only 3,I would really appreciate if someone pointed out my mistake and told me how to get rid of these useless characters...
Your are writing all the content of buffer (100 char) in the receiving file. You need to write the exact amount of data read.
fwrite(buffer, 1, file_size, pFile)
Adding more checks for your code:
#include <stdio.h>
#include <stdlib.h>
#define BUFFER_SIZE 100
int main(void) {
char buffer[BUFFER_SIZE];
size_t file_size;
size_t ret;
FILE* rfile = fopen("input.txt","r+");
if(rfile==NULL)
{
printf("couldn't open File \n");
return 0;
}
fseek(rfile, 0, SEEK_END);
file_size = ftell(rfile);
fseek(rfile,0,SEEK_SET);
printf("File size: %d\n",file_size);
if(!file_size) {
printf("Warring! Empty input file!\n");
} else if( file_size >= BUFFER_SIZE ){
printf("Warring! File size greater than %d. File will be truncated!\n", BUFFER_SIZE);
file_size = BUFFER_SIZE;
}
ret = fread(buffer, sizeof(char), file_size, rfile);
if(file_size != ret) {
printf("I/O error\n");
} else {
FILE* pFile = fopen ( "newfile.txt" , "w+" );
if(!pFile) {
printf("Can not create the destination file\n");
} else {
ret = fwrite (buffer , 1 ,file_size , pFile );
if(ret != file_size) {
printf("Writing error!");
}
fclose (pFile);
}
}
fclose(rfile);
return 0;
}
You need to check the return values from all calls to fseek(), fread() and fwrite(), even fclose().
In your example, you have fread() read 1 block which is 100 bytes long. It's often a better idea to reverse the parameters, like this: ret = fread(buffer,1,file_size,rfile). The ret value will then show how many bytes it could read, instead of just saying it could not read a full block.
Here is an implementation of an (almost) general purpose file copy function:
void fcopy(FILE *f_src, FILE *f_dst)
{
char buffer[BUFSIZ];
size_t n;
while ((n = fread(buffer, sizeof(char), sizeof(buffer), f_src)) > 0)
{
if (fwrite(buffer, sizeof(char), n, f_dst) != n)
err_syserr("write failed\n");
}
}
Given an open file stream f_src to read and another open file stream f_dst to write, it copies (the remainder of) the file associated with f_src to the file associated with f_dst. It does so moderately economically, using the buffer size BUFSIZ from <stdio.h>. Often, you will find that bigger buffers (such as 4 KiB or 4096 bytes, even 64 KiB or 65536 bytes) will give better performance. Going larger than 64 KiB seldom yields much benefit, but YMMV.
The code above calls an error reporting function (err_syserr()) which is assumed not to return. That's why I designated it 'almost general purpose'. The function could be upgraded to return an int value, 0 on success and EOF on a failure:
enum { BUFFER_SIZE = 4096 };
int fcopy(FILE *f_src, FILE *f_dst)
{
char buffer[BUFFER_SIZE];
size_t n;
while ((n = fread(buffer, sizeof(char), sizeof(buffer), f_src)) > 0)
{
if (fwrite(buffer, sizeof(char), n, f_dst) != n)
return EOF; // Optionally report write failure
}
if (ferror(f_src) || ferror(f_dst))
return EOF; // Optionally report I/O error detected
return 0;
}
Note that this design doesn't open or close files; it works with open file streams. You can write a wrapper that opens the files and calls the copy function (or includes the copy code into the function). Also note that to change the buffer size, I simply changed the buffer definition; I didn't change the main copy code. Also note that any 'function call overhead' in calling this little function is completely swamped by the overhead of the I/O operations themselves.
Note ftell returns a long, not a size_t. Shouldn't matter here, though. ftell itself is not necessarily a byte-offset, though. The standard requires it only to be an acceptable argument to fseek. You might get a better result from fgetpos, but it has the same portability issue from the lack of specification by the standard. (Confession: I didn't check the standard itself; got all this from the manpages.)
The more robust way to get a file-size is with fstat.
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd>
struct stat stat_buf;
if (fstat(filename, &buf) == -1)
perror(filename), exit(EXIT_FAILURE);
file_size = statbuf.st_size;
I think the parameters you passed in the fwrite are not in right sequence.
To me it should be like that-
fwrite(buffer,SIZE,1,pFile)
as the syntax of fwrite is
size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);
The function fwrite() writes nmemb elements of data, each size bytes long, to the stream pointed to by stream, obtaining them from the location given by ptr.
So change the sequence and try again.

Resources