What does a file pointer point to in C? - c

I am trying to understand input and output files in C. In the beginning, when we want to open a file to read, we declare a file pointer as follows:
FILE *fptr1 = fopen( "filename", "r")
I understand that FILE is a data structure in the stdio.h library and that it contains information about the file. I also know that the fopen() function returns a FILE structure. But, is that the purpose of the pointer. It just points to a bunch of information about the file? I've been reading into this and I have heard the term "file streams" floating around a bit. I understand that it is a an interface of communication with the file (find it vague, but I'll take it). Is that what the pointer points to in simple terms - a file stream? In the above code example, would the pointer be pointing to an input file stream?
Thank you!

The FILE structure is intended to be opaque. In other words, you are not supposed to look into it if you want your programs to remain portable.
Further, FILE is always used through a pointer, so you don't even need to know its size.
In a way, you can consider it a void * for all intents and purposes.
Now, if you are really interested on what the FILE type may hold, the C standard itself explains it quite well! See C11 7.21.1p2:
(...) FILE which is an object type capable of recording all the information needed to control a stream, including its file position indicator, a pointer to its associated buffer (if any), an error indicator that records whether a read/write error has occurred, and an end-of-file indicator that records whether the end of the file has been reached; (...)
So as you see, at least it contains stuff like:
The position inside the file
A pointer to a buffer
Error flags
EOF flag
It mentions (as you do) streams. You can find some more details about it in section 7.21.2 Streams:
Input and output, whether to or from physical devices such as terminals and tape drives, or whether to or from files supported on structured storage devices, are mapped into logical data streams, whose properties are more uniform than their various inputs and outputs. Two forms of mapping are supported, for text streams and for binary streams.
(...)
A binary stream is an ordered sequence of characters that can transparently record internal data. (...)
As we can read, a stream is an ordered sequence of characters. Note that it does not say whether this sequence is finite or not! (More on that later)
So, how do they relate to files? Let's see section 7.21.3 Files:
A stream is associated with an external file (which may be a physical device) by opening a file, which may involve creating a new file. Creating an existing file causes its former contents to be discarded, if necessary. If a file can support positioning requests (such as a disk file, as opposed to a terminal), then a file position indicator associated with the stream is positioned at the start character number zero) of the file, unless the file is opened with append mode in which case it is implementation-defined whether the file position indicator is initially positioned at the beginning or the end of the file. The file position indicator is maintained by subsequent reads, writes, and positioning requests, to facilitate an orderly progression through the file.
(...)
See, when you open a "disk file" (the typical file in your computer), you are associating a "stream" (finite, in this case) which you can open/read/write/close/... through fread() and related functions; and the data structure that holds all the required information about it is FILE.
However, there are other kinds of files. Imagine a pseudo-random number generator. You can conceptualize it as an infinite read-only file: every time you read it gives you a different value and it never "ends". Therefore, this file would have an infinite stream associated with it. And some operations may not make sense with it (e.g. maybe you cannot seek it, i.e. move the file position indicator).
This only serves as a quick introduction, but as you can see, the FILE structure is an abstraction over the concept of a file. If you want to learn more about this kind of thing, the best you can do is reach for a good book on Operating Systems, e.g. Modern Operating Systems from Tanenbaum. This book also refers to C, so even better.

Related

Copy file in C doens't seem to work completely

For my programming course I have to make a program that copies a file.
This program asks for the following:
an input file in the command prompt
a name for the output file
The files required to copy are .WAV audio files. I tried this with an audio sample of 3 seconds.
The thing is that I do get a file back, for it to be empty. I have added the fclose and fopen statements
while((ch = fgetc(input)) != EOF)
{
fputc(ch, output);
}
I hope someone can point out where I probably made some beginners mistake.
The little while loop you show should in principle work if all prerequisites are met:
The files could be opened.
If on a Microsoft operating system, the files were opened in binary mode (see below).
ch is an int.
In other words, all problems you have are outside this code.
Binary mode: The CR-LF issue
There is a post explaining possible reasons for using a carriage return/linefeed combination; in the end, it is the natural thing to do, given that with typewriters, and by association with teletypes, the two are distinct operations: You move the large lever on the carriage to rotate the platen roller or cylinder a specified number of degrees so that the next line would not print over the previous one; that's the aptly named line feed. Only then, with the same lever, you move the carriage so that the horizontal print position is at the beginning of the line. That's the aptly named carriage return. The order of events is only a technicality.
DOS C implementations tried to be smart: A C program ported from Unix might produce text with only newlines in it; the output routines would transparently add the carriage return so that it would follow the DOS conventions and print properly. Correspondingly, CR/LF combinations in an input file would be silently converted to only LF when read by the standard library implementations.
The DOS file convention also uses CTR-Z (26) as an end-of-file marker. Again, this could be a useful hint to a printer that all data for the current job had been received.
Unfortunately, these conventions were made the default behavior, and today are typically a nuisance: Nobody sends plain text to a printer any longer (apart from the three people who will comment under this post that they still do that).
It is a nuisance because for files that are not plain text silent data changes are catastrophic and must be suppressed, with a b "flag" indicating "binary" data passed in the fopen mode argument: To faithfully read you must specify fopen(filename, "rb"), and in order to faithfully write you must specify fopen(filename, "wb").
Empty file !?
When I tried copying a wave file without the binary flags the data was changed in the described fashion, and the copy stopped before the first byte with the value 26 (CTRL-Z) in the source. In other words, while the copy was corrupt, it was not empty. By the way, all wave files start with the bytes RIFF, so that no CTR-Z can be encountered in the first position.
There are a number of possibilities for an empty target file, the most likely of which:
You didn't emit or missed an error message regarding opening the files (does your editor keep a lock on the output?), and the program crashed silently when one of the file pointers was null. Note that error messages may fail to be printed when you make error output on standard out: That stream is buffered, and buffered output may be lost in a crash. By contrast, output to stderr is unbuffered exactly to prevent message loss.
You are looking at the wrong output file. This kind of error is surprisingly common. You could perform a sanity check by deleting the file you are looking at, or by printing something manually before you start copying.
Generally, check the return value of every operation (including your fputc!).

How is "file position" implemented in a stream (FILE)?

In Chapter 22 of the book "C Programming: A Modern Approach, the author devotes a brief section to the concept of file position. The following description is provided:
Every stream has an associated file position. When a file is opened, the file position is set at the beginning of the file. (If the file is opened in "append" mode, however, the initial file position may be at the beginning or end of the file, depending on the implementation.) Then, when a read or write operation is performed, the file position advances automatically, allowing us to move through the file in a sequential manner.
After this paragraph, the author dives into several <stdio.h> functions (e.g. fseek, ftell, etc), which are related to this notion of "file position".
I made a post recently (What is the difference between a pointer to a buffer and a pointer to a file?), and the provided answer / feedback gave me a decent beginner's understanding of what a stream, FILE, and FILE * actually are. Also revealed to me in this post was the fact that buffers can be automatically ("by default" created when fopen is invoked).
So my question is really a request: could someone provide me, in some greater detail, what exactly file position is? Is it a pointer to the buffer related to fopen? If it's not a pointer to a buffer, does it somehow bare some sort of correspondence TO a pointer to a buffer? Presumably file position is stored inside FILE. etc etc.
Any insight is greatly appreciated! Cheers~
The file position is a number associated with the underlying file 'handle'. That handle would be a file descriptor on POSIX-like systems (strictly the 'open file description' as opposed to 'open file descriptor', but you can forget that distinction for the time being — see POISX open() for more information). It would probably be a 'HANDLE' on Windows (but I reserve the right to be wrong on that). It doesn't matter too much as the FILE * abstraction isolates you, the programmer, from the low-level details.
The file position specifies an offset in bytes from the start of the file where activity (reading or writing) will occur. The position is changed by reading or writing data, or by seeking to a new position. The kernel (operating system) keeps track of the position, moving it when necessary. The structure pointed at by the file stream (FILE *) may also track the position in its data. That's because it has to ensure that changes to the buffer are properly reflected in the file, and changes in the file are properly reflected in the buffer. The buffer contains data associated with some range of positions in the file. That range changes as data is read or written, or as the program seeks on the file.

Use fopen to open file repeatedly in C

I have a question about "fopen" function.
FILE *pFile1, *pFile2;
pFile1 = fopen(fileName,"rb+");
pFile2 = fopen(fileName,"rb+");
Can I say that pFile1==pFile2? Besides, can FILE type be used as a key of map?
Thanks!
Can I say that pFile1 == pFile2?
No pFile1 and pFile2 are pointers to two distinct FILE structures, returned by the two different function calls.
Give it a try!!
To add further:
Note opening a file that is already open has implementation-defined behavior, according to the C Standard:
FIO31-C. Do not open a file that is already open
subclause 7.21.3, paragraph 8 [ISO/IEC 9899:2011]:
Functions that open additional (nontemporary) files require a file
name, which is a string. The rules for composing valid file names are
implementation-defined. Whether the same file can be simultaneously
open multiple times is also implementation-defined.
Some platforms may forbid a file simultaneously being opened multiple times, but other platforms may allow it. Therefore, portable code cannot depend on what will happen if this rule is violated. Although this isn't a problem on POSIX compliant systems. Many applications open a file multiple times to read concurrently (of-course if you wants writing operation also then you may need concurrency control mechanism, but that's a different matter).
Can I say that pFile1==pFile2?
(edited after reading the pertinent remark of Grijesh Chauhan)
you can say that pFile1 != pFile2, because 2 things can happen:
the system forbids opening the file twice, in which case pFile2 will be NULL
the system allows a second opening, i, which case pFile2 will point to a different context.
This is one more reason among thousands to check system calls, by the way.
Assuming the second call succeeded you can,for instance, seek to a given position with pFile1 while you read from another with pFile2.
As a side note, since you will eventually access the same physical disk, it is rarely a good idea to do so unless you know exactly what you're doing. Seeking back and forth like crazy between two different parts of a big file could eventually force the disk driver to wobble between two physical parts of the disk, reducing your I/O performance dramatically (unless the disk is a non-seeking device like an SSD).
can FILE type be used as a key of map?
No, because
it would not make any sense to use an unknown structure of an unknown size whose lifetime you have no direct control of as a key
the FILE class does not implement the necessary comparison operator
You could use a FILE *, though, since any pointer can be used as a map key.
However, it is pretty dangerous to do so. For one thing, the pointer is just like a random number to you. It comes from some memory allocation within the sdtio library, and you have no control over it.
second, if for some reason you deallocate the file handle (i.e. you close the file), you will keep using an invalid pointer reference as a key unless you also remove the file from the map. This is doable, but both awkward and dangerous IMHO.

Copy sparse files

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

ftello/fseeko vs fgetpos/fsetpos

What is the difference between ftello/fseeko and fgetpos/fsetpos? Both seem to be file pointer getting/setting functions that use opaque offset types to sometimes allow 64 bit offsets.
Are they supported on different platforms or by different standards? Is one more flexible in the type of the offset it uses?
And, by the way, I am aware of what is difference between fgetpos/fsetpos and ftell/fseek, but this is not a duplicate. That question asks about ftell/fseek, and the answer is not applicable to ftello/fseeko.
See Portable Positioning for detailed information on the difference. An excerpt:
On some systems where text streams truly differ from binary streams, it is impossible to represent the file position of a text stream as a count of characters from the beginning of the file. For example, the file position on some systems must encode both a record offset within the file, and a character offset within the record.
As a consequence, if you want your programs to be portable to these systems, you must observe certain rules:
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
In a call to fseek or fseeko on a text stream, either the offset must be zero, or whence must be SEEK_SET and the offset must be the result of an earlier call to ftell on the same stream.
The value of the file position indicator of a text stream is undefined while there are characters that have been pushed back with ungetc that haven't been read or discarded. See Unreading.
In a nutshell: fgetpos/fsetpos use a more flexible structure to store additional metadata about the file position state, enabling greater portability (in theory).

Resources