Im trying to write a simple C code that counts how many times a byte is repeated in a file. We tried the code with .txt files and works wonders (max size tested: 137MB). But when we tried it with an image (even small, 2KB) it returned Segmentation Fault 11.
I've done some research and found some specific libs for images, but I don't want to resort to them since the code it's not only meant for images, but for virtually any type of file. Is there a way to simple read a file byte per byte regardless of anything else (extension, meta, etc).
This is the code:
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
FILE *f;
char *file;
long numTotalBytes = 0;
int bytesCount[256] = {0};
f = fopen ( argv[1], "rb");
fseek(f, 0L, SEEK_END);
numTotalBytes = ftell(f);
rewind(f);
file = calloc(1, numTotalBytes);
fread(file, numTotalBytes, 1, f);
fclose(f);
printf("numTotalBytes: %ld", numTotalBytes); //<- this gives the right output even for images
unsigned int i;
for (i=0; i<numTotalBytes; ++i) {
unsigned char pointer = file[i]; //<- This access fails at file[1099]
int pointer_int = (int)pointer;
printf("iteration %i with pointer at %i\n", i, pointer_int); //<- pointer_int is never below 0 or above 255
//++bytesCount[(int)file[i]];
++bytesCount[pointer_int];
}
free(file);
}
Some extra info:
- Changing the extension of the img to .txt doesn't work.
- The code returns Segmentation Fault exactly at iteration 1099 (file I'm using is aprox 163KB so file[i] should accept accesses up to aprox file[163000]).
- For txt files works perfect. Reads the bytes one by one and counts them as expected, regardless of file size.
- I'm on Mac (you never know...)
//EDIT: I have edited the code for a more desglosed and explanatory one because some of you where telling me things I've already tried.
//EDIT_2: Ok guys, never mind. This version should work in any other computer that its not mine. I think the problem is with my terminal when passing arguments but I just switched OS and it works.
Do check if fopen() and calloc() are successful.
The format specifier to print long is %ld, not %lu.
(int)file[i] is bad for array index because converting char to int will preserve its value if all values that can be represented as char are representable in int, and because if char is signed in your environment (and setting), it may access negative index, cause out-of-range access and invoke undefined behavior.
You should change ++bytesCount[(int)file[i]]; to ++bytesCount[(unsigned char)file[i]]; in order to prevent using negative index.
Also note that ftell() with SEEK_END may note be supported for binary stream (N1570 7.21.9.2 The fseek function), so it is better to read one-by-one using fgetc() in order to avoid undefined behavior and to use less memory.
MikeCAT just beat me to it. A bit more explanation follows, in case it helps.
To fix: change file to unsigned char *file and the increment to ++bytesCount[file[i]];.
Exaplanation: per this answer, a plain char may be signed or unsigned. In this case, I'm guessing it defaults to signed. That means any value >=0x80 will become a negative number. Such values are not likely to be in your English-language text file, but are very likely to be in an image! The typecast to (int) will keep negatives negative. Therefore, the code will index byteCounts with a negative number, leading to the segmentation fault.
It might be caused by this line
++bytesCount[(int)file[i]];
The bytesCount is array of 256 ints. If file[i] is more than 256, you are accessing invalid memory and that can cause segmentation fault.
Related
I would like to copy binary file source to file target. Nothing more! The code is inspired from many examples found on the Internet.
#include <stdio.h>
int main(int argc, char **argv) {
FILE *fp1, *fp2;
char ch;
fp1 = fopen("source.pdf", "r");
fp2 = fopen("target.pdf", "w");
while((ch = fgetc(fp1)) != EOF)
fputc(ch, fp2);
fclose(fp1);
fclose(fp2);
return 0;
}
The result differs in file size.
root#vm:/home/coder/test# ls -l
-rwxr-x--- 1 root root 14593 Feb 28 10:24 source.pdf
-rw-r--r-- 1 root root 159 Mar 1 20:19 target.pdf
Ok, so what's the problem?
I know that char is unsigned and get signed when above 80. See here.
This is confirmed when I use printf("%x\n", ch); which returns approximately 50% of the time something like sometimes FFFFFFE1.
The solution to the my issue would be to use int i.s.o. char.
Examples found with char: example 1, example 2
example 3, example 4, ...
Examples found with int: example a, ...
I don't use fancy compiler options.
Why are virtually all code examples found returning fgetc() to an char i.s.o. an int, which would be more correct?
What am I missing?
ISO C mandates that fgetc() returns an int since it must be able to return every possible character in addition to an end-of-file indicator.
So code that places the return value into a char, and uses it to detect EOF, is generally plain wrong and should not be used.
Having said that, two of the examples you gave don't actually do that.
One of them uses fseek and ftell to get the number of bytes in the file and then uses that to control the read/write loop. That's could be problematic since the file can actually change in size after the size is retrieved but that's a different problem to trying to force an int into a char.
The other uses feof immediately after the character is read to check if the end of file has been reached.
But you're correct in that the easiest way to do it is to simply use the return value correctly, something like:
int charInt;
while ((charInt = fgetc(inputHandle)) != EOF)
doSomethingWith(charInt);
Well the thing is most of code you saw then is wrong. There are 3 types of char - signed, unsigned and plain char. Now if plain char is by default signed then a character with decimal value 255 will be considered equal to -1 (EOF). This is not what you want. (Yes decimal value 255 won't be representable in signed char but it's implementation defined behavior and on most ones it will store the bit pattern 0xFF in the char).
Secondly if char is unsigned then it EOF will be considered as 0xFF that is also wrong now and comparison would fail. (Knowing that EOF is -1 it will be converted to CHAR_MAX which is 255 or 0xFF).
That's why int is considered so that it can hold the value of EOF correctly and that is how you should use it.
When I read an unsigned int from a binary file, it only reads the correct value if the value in the file is fairly low - when I try to read a value over 150,000 it gives me something like 9000... It's weird. Smaller numbers work perfectly however, like 50,000...
unsigned int value;
file = fopen(filePath, "rb");
fseek(file, 0, 0);
fread(&value, sizeof(unsigned int), 1, file);
printf("Value from file: %i\n", value);
The binary file was created on the same computer & operating system that the program runs on... what am I missing? The binary files are created properly and most of them do return the correct value & work fine, only the ones with large numbers don't...
You got the right value but used the wrong format string to print it, yielding undefined behavior. %i is for signed integers. You need %u.
The above is definitely a bug, but if it's not the source of the problem at hand, it's also possible that you're using a legacy compiler on Windows and the writing code is failing to open the file in binary mode. In that case, the value may get corrupted if one of the bytes happens to be 0x0a, in which case the value read back would be wrong.
I'm trying to read 14 digit long hexadecimal numbers from a file and then print them. My idea is to use a long long int and read the lines from the files with fscanf as if they were strings and then turn the string into a hex number using atoll. The problem is I am getting a seg value on my fscanf line according to valgrind and I have absolutely no idea why. Here is the code:
#include<stdio.h>
int main(int argc, char **argv){
if(argc != 2){
printf("error argc!= 2\n");
return 0;
}
char *fileName = argv[1];
FILE *fp = fopen( fileName, "r");
if(fp == NULL){
return 0;
}
long long int num;
char *line;
while( fscanf(fp, "%s", line) == 1 ){
num = atoll(line);
printf("%x\n", num);
}
return 0;
}
Are you sure you want to read your numbers as character strings? Why not allow the scanf do the work for you?
long long int num;
while( fscanf(fp, "%llx", &num) == 1 ){ // read a long long int in hex
printf("%llx\n", num); // print a long long int in hex
}
BTW, note the ll size specifier to %x conversion in printf - it defines the integer value will be of long long type.
Edit
Here is a simple example of two loops reading a 3-line input (with two, no and three numbers in consecutive lines) with a 'hex int' format and with a 'string' format:
http://ideone.com/ntzKEi
A call to rewind allows the second loop read the same input data.
That line variable is not initialized, so when fscanf() dereferences it you get undefined behavior.
You should use:
char line[1024];
while(fgets(line, sizeof line, fp) != NULL)
To do the loading.
If you're on C99, you might want to use uint64_t to hold the number, since that makes it clear that 14-digit hexadecimal numbers (4 * 14 = 56) will fit.
The other answers are good, but I want to clarify the actual reason for the crash you are seeing. The problem is that:
fscanf(fp, "%s", line)
... essentially means "read a string from a file, and store it in the buffer pointed at by line". In this case, your line variable hasn't been initialised, so it doesn't point anywhere. Technically, this is undefined behavior; in practice, the result will often be that you write over some arbitrary location in your process's address space; furthermore, since it will often point at an illegal address, the operating system can detect and report it as a segment violation or similar, as you are indeed seeing.
Note that fscanf with a %s conversion will not necessarily read a whole line - it reads a string delimited by whitespace. It might skip lines if they are empty and it might read multiple strings from a single line. This might not matter if you know the precise format of the input file (and it always has one value per line, for instance).
Although it appears in that case that you can probably just use an appropriate modifier to read a hexadecimal number (fscanf(fp, "%llx", &num)), rather than read a string and try to do a conversion, there are various situations where you do need to read strings and especially whole lines. There are various solutions to that problem, depending on what platform you are on. If it's a GNU system (generally including Linux) and you don't care about portability, you could use the m modifier, and change line to &line:
fscanf(fp, "%ms", &line);
This passes a pointer to line to fscanf, rather than its value (which is uninitialised), and the m causes fscanf to allocate a buffer and store its address in line. You then should free the buffer when you are done with it. Check the Glibc manual for details. The nice thing about this approach is that you do not need to know the line length beforehand.
If you are not using a GNU system or you do care about portability, use fgets instead of fscanf - this is more direct and allows you to limit the length of the line read, meaning that you won't overflow a fixed buffer - just be aware that it will read a whole line at a time, unlike fscanf, as discussed above. You should declare line as a char-array rather than a char * and choose a suitable size for it. (Note that you can also specify a "maximum field width" for fscanf, eg fscanf(fp, "%1000s", line), but you really might as well use fgets).
I need to keep trace of an int number greater than 255 in a file. It is greater than the largest unsigned char and so the use of the fputc seems to be not reliable (first question: is it always true?).
I could use fputs, by converting the digits in characters, so obtaining a string; but in the program i need the number as an int too!
So, the question in the title: what is so the most efficient way to write that number? Is there any way to avoid the conversion to string?
Keep that the file should then be readed by another process, where char number should become an int again.
Just write out the binary representation:
int fd;
...
int foo = 1234;
write (fd, &foo, sizeof(foo));
(and add error handling).
Or if you like FILE*
FILE *file;
...
int foo = 1234;
fwrite (&foo, sizeof(foo), 1, file);
(and add error handling).
Note that if your file is to be loaded on a different system, potential with different endianness, you might want to ensure the endianness of the bytes is constant (e.g. most significant byte or least significant byte first). You can use htnol, htons etc. for this if you want. If you know the architecture that is loading the file is the same as that saving it, there is no need for this.
This question already has answers here:
copying the contents of a binary file
(4 answers)
Closed 9 years ago.
The following program is intended to make a copy of one .exe application file.But just one little thing determines whether it indeed gives me a proper copy of the intended file RealPlayer.exe or gives me a corrupted file.
What I do is read from the source file in binary mode and write to the new copy in the same mode.For this I use a variable ch.But if ch is of type char, I get a corrupted file which has a size of few bytes while the original file is 26MB.But if I change the type of ch to int, the program works fine and gives me the exact copy of RealPlayer.exe sized 26MB.So let me ask two questions that arise from this premise.I would appreciate if you can answer both parts:
1) Why does using type char for ch mess things up while int type works?What is wrong with char type?After all, shouldn't it read byte by byte from the original file(as char is one byte itself) and write it byte by byte to the new copy file?After all isn't what the int type does,ie, read 4 bytes from original file and then write that to the copy file?Why the difference between the two?
2) Why is the file so small-sized compared to original file if we use char type for ch?Let's forget for a moment that the copied file is corrupt to begin with and focus on the size.Why is it that the size is so small if we copy character by character (or byte by byte), but is big(original size) when we copy "integer by integer" (or 4-bytes by 4-bytes)?
I was suggested by a friend to simply stop asking questions and use int because it works while char doesn't!!.But I need to understand what's going on here as I see a serious lapse in my understanding in this matter.Your detailed answers are much sought.Thanks.
#include<stdio.h>
#include<stdlib.h>
int main()
{
char ch; //This is the cause of problem
//int ch; //This solves the problem
FILE *fp,*tp;
fp=fopen("D:\\RealPlayer.exe","rb");
tp=fopen("D:\\copy.exe","wb");
if(fp==NULL||tp==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp))!=EOF)
putc(ch,tp);
fclose(fp);
fclose(tp);
}
The problem is in the termination condition for the loop. In particular, the type of the variable ch, combined with rules for implicit type conversions.
while((ch=getc(fp))!=EOF)
getc() returns int - either a value from 0-255 (i.e. a char) or -1 (EOF).
You stuff the result into a char, then promote it back to int to do the comparison. Unpleasant things happen, such as sign extension.
Let's assume your compiler treats "char" as "signed char" (the standard gives it a choice).
You read a bit pattern of 0xff (255) from your binary file - that's -1, expressed as a char. That gets promoted to int, giving you 0xffffffff, and compared with EOF (also -1, i.e 0xffffffff). They match, and your program thinks it found the end of file, and obediently stops copying. Oops!
One other note - you wrote:
After all isn't what the int type does,ie, read 4 bytes from original
file and then write that to the copy file?
That's incorrect. getc(fp) behaves the same regardless of what you do with the value returned - it reads exactly one byte from the file, if there's one available, and returns that value - as an int.
int getc ( FILE * stream );
Returns the character currently pointed by the internal file position indicator of the specified stream.
On success, the character read is returned (promoted to an int value).If you have already defined ch as int all works fine but if ch is defined as char, returned value from getc() is supressed back to char.
above reasons are causing corruption in data and loss in size.