C file reading incorrect number of chars - c

I have stumbled across a problem where I am attempting to read in a file, which is, according to windows, '87.1 kb' in size, and using the ftell method in program, returns '89282', effectively confirming what windows is saying.
So why is every method to read chars from the file only returning 173 or 174 characters?
The file is a .GIF file renamed to .txt (and I am trying to build a program that can load the data fully as I am working on a program to download online images and need to run comparisons on them).
So far I have tried:
fgetc - This returns 173/174 chars.
fread - Same as above, this is with a string with 1024 or more spaces available.
fgets - Doesn't work (as it doesn't return how many characters it has read - characters which include nulls).
setvbuf - Disabling this with _IONBF, or even supplying a buffer of 1024 or more only means 173/174 is still returned.
fflush - This produced a 'result', although a negative one - it returned '2' chars instead of '173'.
I am utterly stumped as to why it isn't reading anything more than 173/174 chars. Is there something I need to compensate for or expect at the lower level? Some buffer I need to expand or some weird character I need to look out for?

Here's one thing to look at. Have a look at the file in a hex viewer and see if there's a CTRL-Z somewhere around that 173/174 offset.
Then check to see if you're opening it with the "r" mode.
If so, it may be that the Windows translation between text and binary is stopping your reading there because CTRL-Z is an EOF marker in text mode. If so, you can probably fix this with "rb" mode on the fopen.
Failing that, you need to post the smallest code segment that exhibits the problem behaviour. It may be obvious to some of us here but only usually if we can see the code :-)

Related

Copy file in C doens't seem to work completely

For my programming course I have to make a program that copies a file.
This program asks for the following:
an input file in the command prompt
a name for the output file
The files required to copy are .WAV audio files. I tried this with an audio sample of 3 seconds.
The thing is that I do get a file back, for it to be empty. I have added the fclose and fopen statements
while((ch = fgetc(input)) != EOF)
{
fputc(ch, output);
}
I hope someone can point out where I probably made some beginners mistake.
The little while loop you show should in principle work if all prerequisites are met:
The files could be opened.
If on a Microsoft operating system, the files were opened in binary mode (see below).
ch is an int.
In other words, all problems you have are outside this code.
Binary mode: The CR-LF issue
There is a post explaining possible reasons for using a carriage return/linefeed combination; in the end, it is the natural thing to do, given that with typewriters, and by association with teletypes, the two are distinct operations: You move the large lever on the carriage to rotate the platen roller or cylinder a specified number of degrees so that the next line would not print over the previous one; that's the aptly named line feed. Only then, with the same lever, you move the carriage so that the horizontal print position is at the beginning of the line. That's the aptly named carriage return. The order of events is only a technicality.
DOS C implementations tried to be smart: A C program ported from Unix might produce text with only newlines in it; the output routines would transparently add the carriage return so that it would follow the DOS conventions and print properly. Correspondingly, CR/LF combinations in an input file would be silently converted to only LF when read by the standard library implementations.
The DOS file convention also uses CTR-Z (26) as an end-of-file marker. Again, this could be a useful hint to a printer that all data for the current job had been received.
Unfortunately, these conventions were made the default behavior, and today are typically a nuisance: Nobody sends plain text to a printer any longer (apart from the three people who will comment under this post that they still do that).
It is a nuisance because for files that are not plain text silent data changes are catastrophic and must be suppressed, with a b "flag" indicating "binary" data passed in the fopen mode argument: To faithfully read you must specify fopen(filename, "rb"), and in order to faithfully write you must specify fopen(filename, "wb").
Empty file !?
When I tried copying a wave file without the binary flags the data was changed in the described fashion, and the copy stopped before the first byte with the value 26 (CTRL-Z) in the source. In other words, while the copy was corrupt, it was not empty. By the way, all wave files start with the bytes RIFF, so that no CTR-Z can be encountered in the first position.
There are a number of possibilities for an empty target file, the most likely of which:
You didn't emit or missed an error message regarding opening the files (does your editor keep a lock on the output?), and the program crashed silently when one of the file pointers was null. Note that error messages may fail to be printed when you make error output on standard out: That stream is buffered, and buffered output may be lost in a crash. By contrast, output to stderr is unbuffered exactly to prevent message loss.
You are looking at the wrong output file. This kind of error is surprisingly common. You could perform a sanity check by deleting the file you are looking at, or by printing something manually before you start copying.
Generally, check the return value of every operation (including your fputc!).

Why is my File I/O in VSCode not working properly? [duplicate]

With the C standard library stdio.h, I read that to output ASCII/text data, one should use mode "w" and to output binary data, one should use "wb". But why the difference?
In either case, I'm just outputting a byte (char) array, right? And if I output a non-ASCII byte in ASCII mode, the program still outputs the correct byte.
Some operating systems - mostly named "windows" - don't guarantee that they will read and write ascii to files exactly the way you pass it in. So on windows they actually map \r\n to \n. This is fine and transparent when reading and writing ascii. But it would trash a stream of binary data. Basically just always give windows the 'b' flag if you want it to faithfully read and write data to files exactly the way you passed it in.
There are certain transformations that can take place when outputting in ASCII (e.g. outputting neline+carriage-return when the outputted character is new-line) -- depending on your platform. Such transformations will not take place when using binary format

What really is EOF for binary files? Condition? Character?

I have managed this far with the knowledge that EOF is a special character inserted automatically at the end of a text file to indicate its end. But I now feel the need for some more clarification on this. I checked on Google and the Wikipedia page for EOF but they couldn't answer the following, and there are no exact Stack Overflow links for this either. So please help me on this:
My book says that binary mode files keep track of the end of file from the number of characters present in the directory entry of the file. (In contrast to text files which have a special EOF character to mark the end). So what is the story of EOF in context of binary files? I am confused because in the following program I successfully use !=EOF comparison while reading from an .exe file in binary mode:
#include<stdio.h>
#include<stdlib.h>
int main()
{
int ch;
FILE *fp1,*fp2;
fp1=fopen("source.exe","rb");
fp2=fopen("dest.exe","wb");
if(fp1==NULL||fp2==NULL)
{
printf("Error opening files");
exit(-1);
}
while((ch=getc(fp1))!=EOF)
putc(ch,fp2);
fclose(fp1);
fclose(fp2);
}
Is EOF a special "character" at all? Or is it a condition as Wikipedia says, a condition where the computer knows when to return a particular value like -1 (EOF on my computer)? Example of such "condition" being when a character-reading function finishes reading all characters present, or when character/string I/O functions encounter an error in reading/writing?
Interestingly, the Stack Overflow tag for EOF blended both those definitions of the EOF. The tag for EOF said "In programming realm, EOF is a sequence of byte (or a chacracter) which indicates that there are no more contents after this.", while it also said in the "about" section that "End of file (commonly abbreviated EOF) is a condition in a computer operating system where no more data can be read from a data source. The data source is usually called a file or stream."
But I have a strong feeling EOF won't be a character as every other function seems to be returning it when it encounters an error during I/O.
It will be really nice of you if you can clear the matter for me.
The various EOF indicators that C provides to you do not necessarily have anything to do with how the file system marks the end of a file.
Most modern file systems know the length of a file because they record it somewhere, separately from the contents of the file. The routines that read the file keep track of where you are reading and they stop when you reach the end. The C library routines generate an EOF value to return to you; they are not returning a value that is actually in the file.
Note that the EOF returned by C library routines is not actually a character. The C library routines generally return an int, and that int is either a character value or an EOF. E.g., in one implementation, the characters might have values from 0 to 255, and EOF might have the value −1. When the library routine encountered the end of the file, it did not actually see a −1 character, because there is no such character. Instead, it was told by the underlying system routine that the end of file had been reached, and it responded by returning −1 to you.
Old and crude file systems might have a value in the file that marks the end of file. For various reasons, this is usually undesirable. In its simplest implementation, it makes it impossible to store arbitrary data in the file, because you cannot store the end-of-file marker as data. One could, however, have an implementation in which the raw data in the file contains something that indicates the end of file, but data is transformed when reading or writing so that arbitrary data can be stored. (E.g., by “quoting” the end-of-file marker.)
In certain cases, things like end-of-file markers also appear in streams. This is common when reading from the terminal (or a pseudo-terminal or terminal-like device). On Windows, pressing control-Z is an indication that the user is done entering input, and it is treated similarly to reach an end-of-file. This does not mean that control-Z is an EOF. The software reading from the terminal sees control-Z, treats it as end-of-file, and returns end-of-file indications, which are likely different from control-Z. On Unix, control-D is commonly a similar sentinel marking the end of input.
This should clear it up nicely for you.
Basically, EOF is just a macro with a pre-defined value representing the error code from I/O functions indicating that there is no more data to be read.
The file doesn't actually contain an EOF. EOF isn't a character of sorts - remember a byte can be between 0 and 255, so it wouldn't make sense if a file could contain a -1. The EOF is a signal from the operating system that you're using, which indicates the end of the file has been reached. Notice how getc() returns an int - that is so it can return that -1 to tell you the stream has reached the end of the file.
The EOF signal is treated the same for binary and text files - the actual definition of binary and text stream varies between the OSes (for example on *nix binary and text mode are the same thing.) Either way, as stated above, it is not part of the file itself. The OS passes it to getc() to tell the program that the end of the stream has been reached.
From From the GNU C library:
This macro is an integer value that is returned by a number of narrow stream functions to indicate an end-of-file condition, or some other error situation. With the GNU C Library, EOF is -1. In other libraries, its value may be some other negative number.
EOF is not a character. In this context, it's -1, which, technically, isn't a character (if you wanted to be extremely precise, it could be argued that it could be a character, but that's irrelevant in this discussion). EOF, just to be clear is "End of File". While you're reading a file, you need to know when to stop, otherwise a number of things could happen depending on the environment if you try to read past the end of the file.
So, a macro was devised to signal that End of File has been reached in the course of reading a file, which is EOF. For getc this works because it returns an int rather than a char, so there's extra room to return something other than a char to signal EOF. Other I/O calls may signal EOF differently, such as by throwing an exception.
As a point of interest, in DOS (and maybe still on Windows?) an actual, physical character ^Z was placed at the end of a file to signal its end. So, on DOS, there actually was an EOF character. Unix never had such a thing.
Well it is pretty much possible to find the EOF of a binary file if you study it's structure.
No, you don't need the OS to know the EOF of an executable EOF.
Almost every type of executable has a Page Zero which describes the basic information that the OS might need while loading the code into the memory and is stored as the first page of that executable.
Let's take the example of an MZ executable.
https://wiki.osdev.org/MZ
Here at offset 2, we have the total number of complete/partial pages and right after that at offset 4 we have the number of bytes in the last page. This information is generally used by the OS to safely load the code into the memory, but you can use it to calculate the EOF of your binary file.
Algorithm:
1. Start
2. Parse the parameter and instantiate the file pointer as per your requirement.
3. Load the first page (zero) in a (char) buffer of default size of page zero and print it.
4. Get the value at *((short int*)(&buffer+2)) and store it in a loop variable called (short int) i.
5. Get the value at *((short int*)(&buffer+4)) and store it in a variable called (short int) l.
6. i--
7. Load and print (or do whatever you wanted to do) 'size of page' characters into a buffer until i equals zero.
8. Once the loop has finished executing just load `l` bytes into that buffer and again perform whatever you wanted to
9. Stop
If you're designing your own binary file format then consider adding some sort of meta data at the start of that file or a special character or word that denotes the end of that file.
And there's a good amount of probability that the OS loads the size of the file from here with the help of simple maths and by analyzing the meta-data even though it might seem that the OS has stored it somewhere along with other information it's expected to store (Abstraction to reduce redundancy).

C File Input/Output for Unknown File Types: File Copying

having some issues with a networking assignment. End goal is to have a C program that grabs a file from a given URL via HTTP and writes it to a given filename. I've got it working fine for most text files, but I'm running into some issues, which I suspect all come from the same root cause.
Here's a quick version of the code I'm using to transfer the data from the network file descriptor to the output file descriptor:
unsigned long content_length; // extracted from HTTP header
unsigned long successfully_read = 0;
while(successfully_read != content_length)
{
char buffer[2048];
int extracted = read(connection,buffer,2048);
fprintf(output_file,buffer);
successfully_read += extracted;
}
As I said, this works fine for most text files (though the % symbol confuses fprintf, so it would be nice to have a way to deal with that). The problem is that it just hangs forever when I try to get non-text files (a .png is the basic test file I'm working with, but the program needs to be able to handle anything).
I've done some debugging and I know I'm not going over content_length, getting errors during read, or hitting some network bottleneck. I looked around online but all the C file i/o code I can find for binary files seems to be based on the idea that you know how the data inside the file is structured. I don't know how it's structured, and I don't really care; I just want to copy the contents of one file descriptor into another.
Can anyone point me towards some built-in file i/o functions that I can bludgeon into use for that purpose?
Edit: Alternately, is there a standard field in the HTTP header that would tell me how to handle whatever file I'm working with?
You are using the wrong tool for the job. fprintf takes a format string and extra arguments, like this:
fprintf(output_file, "hello %s, today is the %d", cstring, dayoftheweek);
If you pass the second argument from an unknown source (like the web, which you are doing) you can accidentally have %s or %d or other format specifiers in the string. Then fprintf will try to read more arguments than it was passed, and cause undefined behaviour.
Use fwrite for this:
fwrite(buffer, 1, extracted, output_file);
A couple things with your code:
For fprintf - you are using the data as the second argument, when in fact it should be the format, and the data should be the third argument. This is why you are getting problems with the % character, and why it is struggling when presented with binary data, because it is expecting a format string.
You need to use a different function, such as fwrite, to output the file.
As a side note this is a bit of a security problem - if you fetch a specially crafted file from the server it is possible to expose random areas of your memory.
In addition to Seth's answer: unless you are using a third-party library for handling all the HTTP stuff, you need to deal with the Transfer-Encoding header and the possible compression, or at least detect them and throw an error if you don't know how to handle that case.
In general, it may (or may not) be a good idea to parse the HTTP response headers, and only if they contain exclusively stuff that you understand should you continue to interpret the data that follows the header.
I bet your program is hanging because it's expecting X bytes but receiving Y instead, with X < Y (most likely, sans compression - but PNG don't compress well with gzip). You'll get chunks [*] of data, with one of the chunks most likely spanning content_length so your condition while(successfully_read != content_length) is always true.
You could try running your program under strace or whatever its equivalent is for your OS, if you want to see how your program continues trying to read data it will never get (because you've likely made an HTTP/1.1 request that holds the connection open, and you haven't made a second request) or has ended (if the server closes the connection, your (repeated) calls to read(2) will just return 0, which leaves your (still true) loop condition unchanged.
If you are sending your program's output to stdout, you may find that it produces no output - this can happen if the resource you are retrieving contains no newline or other flush-forcing control characters. Other stdio buffering regimes may apply when output goes to a file. (For example, the file will remain empty until the stdio buffers have accumulates at least 4096 bytes.)
[*] Then there's also Transfer-Encoding: chunked, as #roland-illig alludes to, which will ruin the exact equivalence between content_length (presumably derived from the eponymous HTTP header) and the actual number of bytes transferred over the socket.
You are opening the file as a text file. Doing so means that the program will add \r\n characters at the end of every write() call. Try opening the file as binary, and those errors in size shall go away.

Using fseek to backtrack

Is using fseek to backtrack character fscanf operations reliable?
Like for example if I have just fscanf-ed 10 characters but I would like to backtrack the 10 chars can I just fseek(infile, -10, SEEK_CUR) ?
For most situations it works but I seem to have problems with the character ^M. Apparently fseek registers it as a char but fscanf doesn't register it, thus in my previous example a 10 char block containing a ^M would require fseek(infile, -11, SEEK_CUR) instead. fseek(infile, -10, SEEK_CUR) would make bring it short by 1 character.
Why is this so?
Edit: I was using fopen in text mode
You're seeing the difference between a "text" and a "binary" file. When a file is opened in text mode (no 'b' in the fopen second argument), the stdio library may (indeed, must) interpret the contents of the file according to the operating system's conventions for text files. For example, in Windows, a line ends with \r\n, and this gets translated to a single \n by stdio, since that is the C convention. When writing to a text file, a single \n gets output as \r\n.
This makes it easier to write portable C programs that handle text files. Some details become complicated, however, and fseeking is one of them. Because of this, the C standard only defines fseek in text files in a few cases: to the very beginning, to the very end, to the current position, and to a previous position that has been retrieved with ftell. In other words, you can't compute a location to seek to for text files. Or you can, but you have to take care of the all the platform-specific details yourself.
Alternatively, you can use binary files and do the line-ending transformations yourself. Again, portability suffers.
In your case, if you just want to go back to where you last did fscancf, the easiest would be to use ftell just before you fscanf.
This is because fseek works with bytes, whereas fscanf intelligently handles that the carriage return and line feed are two bytes, and swallows them as one char.
Fseek has no understanding of the file's contents and just moves the filepointer 10 characters back.
fscanf depending on the OS, may interpret newlines differently; it may even be so that fscanf will insert the ^M if you're on DOS and the ^M does not appear in the file. Check your manual that came with your C compiler
Just tried this with VS2008 and found that fscanf and fseek treated the CR and LF characters in the same way (as a single character).
So with two files:
0000000: 3132 3334 3554 3738 3930 3132 3334 3536 12345X7890123456
and
0000000: 3132 3334 350d 0a37 3839 3031 3233 3435 12345..789012345
If I read 15 characters I get to the second '5', then seek back 10 characters, my next character read is the 'X' in the first case and the CRLF in the second.
This seems like a very OS/compiler specific problem.
Did you test the return value of fscanf? Post some code.
Take a look at ungetc. You may have to run a loop over it.

Resources