Printing file content, help for understanding different outputs [duplicate] - c

I would like to copy binary file source to file target. Nothing more! The code is inspired from many examples found on the Internet.
#include <stdio.h>
int main(int argc, char **argv) {
FILE *fp1, *fp2;
char ch;
fp1 = fopen("source.pdf", "r");
fp2 = fopen("target.pdf", "w");
while((ch = fgetc(fp1)) != EOF)
fputc(ch, fp2);
fclose(fp1);
fclose(fp2);
return 0;
}
The result differs in file size.
root#vm:/home/coder/test# ls -l
-rwxr-x--- 1 root root 14593 Feb 28 10:24 source.pdf
-rw-r--r-- 1 root root 159 Mar 1 20:19 target.pdf
Ok, so what's the problem?
I know that char is unsigned and get signed when above 80. See here.
This is confirmed when I use printf("%x\n", ch); which returns approximately 50% of the time something like sometimes FFFFFFE1.
The solution to the my issue would be to use int i.s.o. char.
Examples found with char: example 1, example 2
example 3, example 4, ...
Examples found with int: example a, ...
I don't use fancy compiler options.
Why are virtually all code examples found returning fgetc() to an char i.s.o. an int, which would be more correct?
What am I missing?

ISO C mandates that fgetc() returns an int since it must be able to return every possible character in addition to an end-of-file indicator.
So code that places the return value into a char, and uses it to detect EOF, is generally plain wrong and should not be used.
Having said that, two of the examples you gave don't actually do that.
One of them uses fseek and ftell to get the number of bytes in the file and then uses that to control the read/write loop. That's could be problematic since the file can actually change in size after the size is retrieved but that's a different problem to trying to force an int into a char.
The other uses feof immediately after the character is read to check if the end of file has been reached.
But you're correct in that the easiest way to do it is to simply use the return value correctly, something like:
int charInt;
while ((charInt = fgetc(inputHandle)) != EOF)
doSomethingWith(charInt);

Well the thing is most of code you saw then is wrong. There are 3 types of char - signed, unsigned and plain char. Now if plain char is by default signed then a character with decimal value 255 will be considered equal to -1 (EOF). This is not what you want. (Yes decimal value 255 won't be representable in signed char but it's implementation defined behavior and on most ones it will store the bit pattern 0xFF in the char).
Secondly if char is unsigned then it EOF will be considered as 0xFF that is also wrong now and comparison would fail. (Knowing that EOF is -1 it will be converted to CHAR_MAX which is 255 or 0xFF).
That's why int is considered so that it can hold the value of EOF correctly and that is how you should use it.

Related

Segmentation Fault 11 when trying to read an image byte per byte

Im trying to write a simple C code that counts how many times a byte is repeated in a file. We tried the code with .txt files and works wonders (max size tested: 137MB). But when we tried it with an image (even small, 2KB) it returned Segmentation Fault 11.
I've done some research and found some specific libs for images, but I don't want to resort to them since the code it's not only meant for images, but for virtually any type of file. Is there a way to simple read a file byte per byte regardless of anything else (extension, meta, etc).
This is the code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
FILE *f;
char *file;
long numTotalBytes = 0;
int bytesCount[256] = {0};
f = fopen ( argv[1], "rb");
fseek(f, 0L, SEEK_END);
numTotalBytes = ftell(f);
rewind(f);
file = calloc(1, numTotalBytes);
fread(file, numTotalBytes, 1, f);
fclose(f);
printf("numTotalBytes: %ld", numTotalBytes); //<- this gives the right output even for images
unsigned int i;
for (i=0; i<numTotalBytes; ++i) {
unsigned char pointer = file[i]; //<- This access fails at file[1099]
int pointer_int = (int)pointer;
printf("iteration %i with pointer at %i\n", i, pointer_int); //<- pointer_int is never below 0 or above 255
//++bytesCount[(int)file[i]];
++bytesCount[pointer_int];
}
free(file);
}
Some extra info:
- Changing the extension of the img to .txt doesn't work.
- The code returns Segmentation Fault exactly at iteration 1099 (file I'm using is aprox 163KB so file[i] should accept accesses up to aprox file[163000]).
- For txt files works perfect. Reads the bytes one by one and counts them as expected, regardless of file size.
- I'm on Mac (you never know...)
//EDIT: I have edited the code for a more desglosed and explanatory one because some of you where telling me things I've already tried.
//EDIT_2: Ok guys, never mind. This version should work in any other computer that its not mine. I think the problem is with my terminal when passing arguments but I just switched OS and it works.
Do check if fopen() and calloc() are successful.
The format specifier to print long is %ld, not %lu.
(int)file[i] is bad for array index because converting char to int will preserve its value if all values that can be represented as char are representable in int, and because if char is signed in your environment (and setting), it may access negative index, cause out-of-range access and invoke undefined behavior.
You should change ++bytesCount[(int)file[i]]; to ++bytesCount[(unsigned char)file[i]]; in order to prevent using negative index.
Also note that ftell() with SEEK_END may note be supported for binary stream (N1570 7.21.9.2 The fseek function), so it is better to read one-by-one using fgetc() in order to avoid undefined behavior and to use less memory.
MikeCAT just beat me to it. A bit more explanation follows, in case it helps.
To fix: change file to unsigned char *file and the increment to ++bytesCount[file[i]];.
Exaplanation: per this answer, a plain char may be signed or unsigned. In this case, I'm guessing it defaults to signed. That means any value >=0x80 will become a negative number. Such values are not likely to be in your English-language text file, but are very likely to be in an image! The typecast to (int) will keep negatives negative. Therefore, the code will index byteCounts with a negative number, leading to the segmentation fault.
It might be caused by this line
++bytesCount[(int)file[i]];
The bytesCount is array of 256 ints. If file[i] is more than 256, you are accessing invalid memory and that can cause segmentation fault.

Why is EOF defined to be −1 when −1 cannot be represented in a char?

I'm learning the C programming on a raspberry pi, however I found that my program never catches the EOF successfully. I use char c=0; printf("%d",c-1); to test the char type, finding that the char type ranges from 0 to 255, as an unsigned short. but the EOF defined in stdio.h is (-1). So is the wrong cc package installed on my Pi? how can I fix it? If I changed the EOF value in stdio.h manually, will there be further problems?
what worries me is that ,when I learning from the K&R book, there are examples which use code like while ((c=getchar())!=EOF), I followed that on my Ubuntu machine and it works fine. I just wonder if such kind of syntax is abandoned by modern C practice or there is something conflict in my Raspberry Pi?
here is my code:
#include <stdio.h>
int main( void )
{
char c;
int i=0;
while ((c=getchar())!=EOF&&i<50) {
putchar(c);
i++;
}
if (c==EOF)
printf("\nEOF got.\n");
while ((c=getchar())!=EOF&&i<500) {
printf("%d",c);
i++;
}
}
even when I redirect the input to an file, it keeps printing 255 on the screen, never terminate this program.
Finally I found that I'm wrong,In the K&R book, it defined c as an int, not a char. Problem solved.
You need to store the character read by fgetc(), getchar(), etc. in an int so you can catch the EOF. This is well-known and has always been the case everywhere. EOF must be distinguishable from all proper characters, so it was decided that functions like fgetc() return valid characters as non-negative values (even if char is signed). An end-of-file condition is signalled by -1, which is negative and thus cannot collide with any valid character fgetc() could return.
Do not edit the system headers and especially do not change the value of constants defined there. If you do that, you break these headers. Notice that even if you change the value of EOF in the headers, this won't change the value functions like fgetc() return on end-of-file or error, it just makes EOF have the wrong value.
Why is EOF defined to be −1 when −1 cannot be represented in a char?
Because EOF isn't a character but a state.
If I changed the EOF value in stdio.h manually, will there be further
problems?
Absolutely, since you would be effectively breaking the header entirely. A header is not an actual function, just a set of prototypes and declarations for functions that are defined elsewhere ABSOLUTELY DO NOT change system headers, you will never succeed in doing anything but breaking your code, project and/or worse things.
On the subject of EOF: EOF is not a character, and thus cannot be represented in a character variable. To get around this, most programmers simple use an int value (by default signed) that can interpret the -1 from EOF. The reason that EOF can never be a character is because otherwise there would be one character indistinguishable from the end of file indicator.
int versus char.
fgetc() returns an int, not char. The values returned are in the range of unsigned char and EOF. This is typically 257 different values. So saving the result in char, signed char, unsigned char will lose some distinguishably.
Instead save the fgetc() return value in an int. After testing for an EOF result, the value can be saved as a char if needed.
// char c;
int c;
...
while ((c=getchar())!=EOF&&i<50) {
char ch = c;
...
Detail: "Why is EOF defined to be −1 when −1 cannot be represented in a char?" misleads. On systems where char is signed and EOF == -1, a char can have the value of EOF. Yet on such systems, a char can have a value of -1 that represents a character too - they overlap. So a char cannot distinctively represent all char and EOF. Best to use an int to save the return value of fgetc().
... the fgetc function obtains that character as an unsigned char converted to an int and ...
If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, ... and the fgetc function returns EOF. ... C11 §7.21.7.1 2-3

To copy files in binary mode,why it doesn't work when we read to and write from a character variable? [duplicate]

This question already has answers here:
copying the contents of a binary file
(4 answers)
Closed 9 years ago.
The following program is intended to make a copy of one .exe application file.But just one little thing determines whether it indeed gives me a proper copy of the intended file RealPlayer.exe or gives me a corrupted file.
What I do is read from the source file in binary mode and write to the new copy in the same mode.For this I use a variable ch.But if ch is of type char, I get a corrupted file which has a size of few bytes while the original file is 26MB.But if I change the type of ch to int, the program works fine and gives me the exact copy of RealPlayer.exe sized 26MB.So let me ask two questions that arise from this premise.I would appreciate if you can answer both parts:
1) Why does using type char for ch mess things up while int type works?What is wrong with char type?After all, shouldn't it read byte by byte from the original file(as char is one byte itself) and write it byte by byte to the new copy file?After all isn't what the int type does,ie, read 4 bytes from original file and then write that to the copy file?Why the difference between the two?
2) Why is the file so small-sized compared to original file if we use char type for ch?Let's forget for a moment that the copied file is corrupt to begin with and focus on the size.Why is it that the size is so small if we copy character by character (or byte by byte), but is big(original size) when we copy "integer by integer" (or 4-bytes by 4-bytes)?
I was suggested by a friend to simply stop asking questions and use int because it works while char doesn't!!.But I need to understand what's going on here as I see a serious lapse in my understanding in this matter.Your detailed answers are much sought.Thanks.
#include<stdio.h>
#include<stdlib.h>
int main()
{
char ch; //This is the cause of problem
//int ch; //This solves the problem
FILE *fp,*tp;
fp=fopen("D:\\RealPlayer.exe","rb");
tp=fopen("D:\\copy.exe","wb");
if(fp==NULL||tp==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp))!=EOF)
putc(ch,tp);
fclose(fp);
fclose(tp);
}
The problem is in the termination condition for the loop. In particular, the type of the variable ch, combined with rules for implicit type conversions.
while((ch=getc(fp))!=EOF)
getc() returns int - either a value from 0-255 (i.e. a char) or -1 (EOF).
You stuff the result into a char, then promote it back to int to do the comparison. Unpleasant things happen, such as sign extension.
Let's assume your compiler treats "char" as "signed char" (the standard gives it a choice).
You read a bit pattern of 0xff (255) from your binary file - that's -1, expressed as a char. That gets promoted to int, giving you 0xffffffff, and compared with EOF (also -1, i.e 0xffffffff). They match, and your program thinks it found the end of file, and obediently stops copying. Oops!
One other note - you wrote:
After all isn't what the int type does,ie, read 4 bytes from original
file and then write that to the copy file?
That's incorrect. getc(fp) behaves the same regardless of what you do with the value returned - it reads exactly one byte from the file, if there's one available, and returns that value - as an int.
int getc ( FILE * stream );
Returns the character currently pointed by the internal file position indicator of the specified stream.
On success, the character read is returned (promoted to an int value).If you have already defined ch as int all works fine but if ch is defined as char, returned value from getc() is supressed back to char.
above reasons are causing corruption in data and loss in size.

fread only reading 64 bytes?

As the title says, freads appears to only be reading the first 64 characters. Relevant code:
FILE* sigD = fopen("signature", "r");
char *sig[255];
fread(sig, 255, 255, sigD);
close(sigD);
fputs(sig, stdout);
Console output:
user#PC:~$ ./a.out --has-sig
;2F*S|tr;;E9;Yb=R6)!fcXhoX#RC`#NzLy<}w#T+uvH${3Et&9K&-0~%D{1
user#PC:~$
user#PC:~$ cat signature
;2F*S|tr;;E9;Yb=R6)!fcXhoX#RC`#NzLy<}w#T+uvH${3Et&9K&-0~%D{1N{7ry:-B9b:kGB=Gkk9V+Cc$8a&35W{15Q~#-+PMeqa;#cKA7Ew3G6P4smDdJWV2#>R!V#ki#(Xj<a,^B)qJ5D&bON//?%/!G)XA&m|8:1mVHmx{7nQoRJ%v{(K:;JtX2hOm/dhVm9mnuDMSbQX55ouVnmECbA`/`!?=Mh0Ab^#vk*K*HG5$omu6716/Loh1Ht
h
As that log shows, there is 254 characters in the file, but only 64 are getting read.
EDIT: problem wasn't with fread, I had accidentally written in zero-terminators into the file.
It is not clear if this is related, but there seems to be a couple problems:
char *sig[255];
fread(sig, 255, 255, sigD);
The call to fread is not consistent with the declaration. It should maybe be the following (you probably want an array of char rather than an array of pointers to char). And the size/nitems info passed to fread was not correct:
char sig[255];
// initially I had this as 'sizeof(), 1' but I think for this file it would make
// more sense as the following (nitems=255):
fread(sig, 1, sizeof(sig), sigD);
And while it should not matter, you might try opening it with a mode of "rb" to force a binary open (the b for binary is supposed to be ignored on POSIX conforming systems).
Your definition of sig is incorrect. If you want an array of characters you must remove the asterisk. You have defined an array of character pointers. It should look like:
char sig[255];

Reading general file

I'm making a program that reads in a file from stdin, does something to it and sends it to stdout.
As it stands, I have a line in my program:
while((c = getchar()) != EOF){
where c is an int.
However the problem is I want to use this program on ELF executables. And it appears that there must be the byte that represents EOF for ascii files inside the executable, which results in it being truncated (correct me if I'm wrong here - this is just my hypothesis).
What is an effective general way to go about doing this? I could dig up documents on the ELF format and then just check for whatever comes at the end. That would be useful, but I think it would be better if I could still apply this program to any kind of file.
You'll be fine - the EOF constant doesn't contain a valid ASCII value (it's typically -1).
For example, below is an excerpt from stdio.h on my system:
/* End of file character.
Some things throughout the library rely on this being -1. */
#ifndef EOF
# define EOF (-1)
#endif
You might want to go a bit lower level and use the system functions like open(), close() and read(), this way you can do what you like with the input as it will get stored in your own buffer.
You are doing it correctly.
EOF is not a character. There is no way c will have EOF to represent any byte in the stream. If / when c indeed contains EOF, that particular value did not originate from the file itself, but from the underlying library / OS. EOF is a signal that something went wrong.
Make sure c is an int though
Oh ... and you might want to read from a stream under your control. In the absence of code to do otherwise, stdin is subject to "text translation" which might not be desirable when reading binary data.
FILE *mystream = fopen(filename, "rb");
if (mystream) {
/* use fgetc() instead of getchar() */
while((c = fgetc(mystream)) != EOF) {
/* ... */
}
fclose(mystream);
} else {
/* error */
}
From the getchar(3) man page:
Character values are returned as an
unsigned char converted to an int.
This means, a character value read via getchar, can never be equal to an signed integer of -1. This little program explains it:
int main(void)
{
int a;
unsigned char c = EOF;
a = (int)c;
//output: 000000ff - 000000ff - ffffffff
printf("%08x - %08x - %08x\n", a, c, -1);
return 0;
}

Resources