How is "file position" implemented in a stream (FILE)? - c

In Chapter 22 of the book "C Programming: A Modern Approach, the author devotes a brief section to the concept of file position. The following description is provided:
Every stream has an associated file position. When a file is opened, the file position is set at the beginning of the file. (If the file is opened in "append" mode, however, the initial file position may be at the beginning or end of the file, depending on the implementation.) Then, when a read or write operation is performed, the file position advances automatically, allowing us to move through the file in a sequential manner.
After this paragraph, the author dives into several <stdio.h> functions (e.g. fseek, ftell, etc), which are related to this notion of "file position".
I made a post recently (What is the difference between a pointer to a buffer and a pointer to a file?), and the provided answer / feedback gave me a decent beginner's understanding of what a stream, FILE, and FILE * actually are. Also revealed to me in this post was the fact that buffers can be automatically ("by default" created when fopen is invoked).
So my question is really a request: could someone provide me, in some greater detail, what exactly file position is? Is it a pointer to the buffer related to fopen? If it's not a pointer to a buffer, does it somehow bare some sort of correspondence TO a pointer to a buffer? Presumably file position is stored inside FILE. etc etc.
Any insight is greatly appreciated! Cheers~

The file position is a number associated with the underlying file 'handle'. That handle would be a file descriptor on POSIX-like systems (strictly the 'open file description' as opposed to 'open file descriptor', but you can forget that distinction for the time being — see POISX open() for more information). It would probably be a 'HANDLE' on Windows (but I reserve the right to be wrong on that). It doesn't matter too much as the FILE * abstraction isolates you, the programmer, from the low-level details.
The file position specifies an offset in bytes from the start of the file where activity (reading or writing) will occur. The position is changed by reading or writing data, or by seeking to a new position. The kernel (operating system) keeps track of the position, moving it when necessary. The structure pointed at by the file stream (FILE *) may also track the position in its data. That's because it has to ensure that changes to the buffer are properly reflected in the file, and changes in the file are properly reflected in the buffer. The buffer contains data associated with some range of positions in the file. That range changes as data is read or written, or as the program seeks on the file.

Related

What does a file pointer point to in C?

I am trying to understand input and output files in C. In the beginning, when we want to open a file to read, we declare a file pointer as follows:
FILE *fptr1 = fopen( "filename", "r")
I understand that FILE is a data structure in the stdio.h library and that it contains information about the file. I also know that the fopen() function returns a FILE structure. But, is that the purpose of the pointer. It just points to a bunch of information about the file? I've been reading into this and I have heard the term "file streams" floating around a bit. I understand that it is a an interface of communication with the file (find it vague, but I'll take it). Is that what the pointer points to in simple terms - a file stream? In the above code example, would the pointer be pointing to an input file stream?
Thank you!
The FILE structure is intended to be opaque. In other words, you are not supposed to look into it if you want your programs to remain portable.
Further, FILE is always used through a pointer, so you don't even need to know its size.
In a way, you can consider it a void * for all intents and purposes.
Now, if you are really interested on what the FILE type may hold, the C standard itself explains it quite well! See C11 7.21.1p2:
(...) FILE which is an object type capable of recording all the information needed to control a stream, including its file position indicator, a pointer to its associated buffer (if any), an error indicator that records whether a read/write error has occurred, and an end-of-file indicator that records whether the end of the file has been reached; (...)
So as you see, at least it contains stuff like:
The position inside the file
A pointer to a buffer
Error flags
EOF flag
It mentions (as you do) streams. You can find some more details about it in section 7.21.2 Streams:
Input and output, whether to or from physical devices such as terminals and tape drives, or whether to or from files supported on structured storage devices, are mapped into logical data streams, whose properties are more uniform than their various inputs and outputs. Two forms of mapping are supported, for text streams and for binary streams.
(...)
A binary stream is an ordered sequence of characters that can transparently record internal data. (...)
As we can read, a stream is an ordered sequence of characters. Note that it does not say whether this sequence is finite or not! (More on that later)
So, how do they relate to files? Let's see section 7.21.3 Files:
A stream is associated with an external file (which may be a physical device) by opening a file, which may involve creating a new file. Creating an existing file causes its former contents to be discarded, if necessary. If a file can support positioning requests (such as a disk file, as opposed to a terminal), then a file position indicator associated with the stream is positioned at the start character number zero) of the file, unless the file is opened with append mode in which case it is implementation-defined whether the file position indicator is initially positioned at the beginning or the end of the file. The file position indicator is maintained by subsequent reads, writes, and positioning requests, to facilitate an orderly progression through the file.
(...)
See, when you open a "disk file" (the typical file in your computer), you are associating a "stream" (finite, in this case) which you can open/read/write/close/... through fread() and related functions; and the data structure that holds all the required information about it is FILE.
However, there are other kinds of files. Imagine a pseudo-random number generator. You can conceptualize it as an infinite read-only file: every time you read it gives you a different value and it never "ends". Therefore, this file would have an infinite stream associated with it. And some operations may not make sense with it (e.g. maybe you cannot seek it, i.e. move the file position indicator).
This only serves as a quick introduction, but as you can see, the FILE structure is an abstraction over the concept of a file. If you want to learn more about this kind of thing, the best you can do is reach for a good book on Operating Systems, e.g. Modern Operating Systems from Tanenbaum. This book also refers to C, so even better.

Text files edit C

I have a program which takes data(int,floats and strings) given by the user and writes it in a text file.Now I have to update a part of that written data.
For example:
At line 4 in file I want to change the first 2 words (there's an int and a float). How can I do that?
With the information I found out, fseek() and fputs() can be used but I don't know exactly how to get to a specific line.
(Explained code will be appreciated as I'm a starter in C)
You can't "insert" characters in a file. You will have to create program, which will read whole file, then copy part before insert to a new file, your edition, rest of file.
You really need to read all the file, and ignore what is not needed.
fseek is not really useful: it positions the file at some byte offset (relative to the start or the end of the file) and don't know about line boundaries.
Actually, lines inside a file are an ill defined concept. Often a line is a sequence of bytes (different from the newline character) ended by a newline ('\n'). Some operating systems (Windows, MacOSX) read in a special manner text files (e.g. the real file contains \r\n to end each line, but the C library gives you the illusion that you have read \n).
In practice, you probably want to use line input routines notably getline (or perhaps fgets).
if you use getline you should care about free-ing the line buffer.
If your textual file has a very regular structure, you might fscanf the data (ignoring what you need to skip) without caring about line boundaries.
If you wanted to absolutely use fseek (which is a mistake), you would have to read the file twice: a first time to remember where each line starts (or ends) and a second time to fseek to the line start. Still, that does not work for updates, because you cannot insert bytes in the middle of a file.
And in practice, the most costly operation is the actual disk read. Buffering (partly done by the kernel and <stdio.h> functions, and partly by you when you deal with lines) is negligible.
Of course you cannot change in place some line in a file. If you need to do that, process the file for input, produce some output file (containing the modified input) and rename that when finished.
BTW, you might perhaps be interested in indexed files like GDBM etc... or even in databases like SqlLite, MariaDb, mongodb etc.... and you might be interested in standard textual serialization formats like JSON or YAML (both have many libraries, even for C, to deal with them).
fseek() is used for random-access files where each record of data has the same size. Typically the data is binary, not text.
To solve your particular issue, you will need to read one line at a time to find the line you want to change. A simple solution to make the change is to write these lines to a temporary file, write the changes to the same temporary file, then skip the parts from the original file that you want to change and copy the reset to the temporary file. Finally, close the original file, copy the temporary file to it, and delete the temporary file.
With that said, I suggest that you learn more about random-access files. These are very useful when storing records all of the same size. If you have control over creating the orignal file, these might be better for your current purpose.

Copy sparse files

I'm trying to understand Linux (UNIX) low-level interfaces and as an exercise want to write a code which will copy a file with holes into a new file (again with holes).
So my question is, how to read from the first file not till the first hole, but till the very end of the file?
If I'm not mistaken, read() returns 0 when reaches the first hole(EOF).
I was thinking about seeking right byte by byte and trying to read this byte, but then I have to know the number of holes in advance.
If by holes you mean sparse files, then you have to find the holes in the input file and recreate them using lseek when writing the output file. Since Linux 3.1, you can even use lseek to jump to the beginning or end of a hole, as described in great detail in the man page.
As ThiefMaster already pointed out, normal file operations will treat holes simply as sequences of zero bytes, so you won't see the EOF you mention.
For copies of sparse files, from the cp manual;
By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well. That is the behavior selected by --sparse=auto. Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes. Use --sparse=never to inhibit creation of sparse files.
Thus, try --sparse=always if you need to copy a sparse file 'as-is' (still seems affected by an algo)
A file is not presented as if it has any gaps. If your intention is to say that the file has sections on one area of the disk, then more on another, etc., you are not going to be able to see this through a call to open() on that file and a series of read() calls. You would instead need to open() and read() the raw disk instead, seeking to sectors on your own.
If your meaning of "holes" in a file is as #ThiefMaster says, just areas of 0 bytes -- these are only "holes" according to your application use of the data; to the file system they're just bytes in a file, no different than any other. In this case, you can copy it through a simple read of the data source and write to the data target, and you will get a full copy (along with what you're calling holes).

ftello/fseeko vs fgetpos/fsetpos

What is the difference between ftello/fseeko and fgetpos/fsetpos? Both seem to be file pointer getting/setting functions that use opaque offset types to sometimes allow 64 bit offsets.
Are they supported on different platforms or by different standards? Is one more flexible in the type of the offset it uses?
And, by the way, I am aware of what is difference between fgetpos/fsetpos and ftell/fseek, but this is not a duplicate. That question asks about ftell/fseek, and the answer is not applicable to ftello/fseeko.
See Portable Positioning for detailed information on the difference. An excerpt:
On some systems where text streams truly differ from binary streams, it is impossible to represent the file position of a text stream as a count of characters from the beginning of the file. For example, the file position on some systems must encode both a record offset within the file, and a character offset within the record.
As a consequence, if you want your programs to be portable to these systems, you must observe certain rules:
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
In a call to fseek or fseeko on a text stream, either the offset must be zero, or whence must be SEEK_SET and the offset must be the result of an earlier call to ftell on the same stream.
The value of the file position indicator of a text stream is undefined while there are characters that have been pushed back with ungetc that haven't been read or discarded. See Unreading.
In a nutshell: fgetpos/fsetpos use a more flexible structure to store additional metadata about the file position state, enabling greater portability (in theory).

where is the position indicator after executing fseek with a too large value

What happened if I use fseek
fseek(fileptr,10000L,SEEK_CUR);
and it returns a nonzero value, as an indicator that it could not successfully execute, because the file was not big enough for the step.
At what position is the position indicator now? At the end of the file, or did it stay on the position it used to be, before execution?
The context:
I want to figure out a way of extracting the size of a file, which has more bytes than long could save safely. Therefore need to go step by step through the file and express the last few bytes with a fraction of a normal step.
E.g: 2,5 * 100 = size of 250Bytes with a step size of 100Bytes
fseek() won't fail because the file is "not big enough for the step". Provided fileptr is a valid stream, the position indicator will move to the previous position plus 10000L bytes, even if this is past the end of the file.
It is possible for fseek() to fail (for example, if the second argument wasn't one of the SEEK_xxx constants). In this case, the position indicator will remain unchanged.
After a failed seek, the file position should remain the same as whatever it was before.
I was going to suggest that you use lseek instead of your hack to move ahead in the file by multiple steps because lseek works with the off_t type which should hopefully be a 64-bit type for you (see also #define _FILE_OFFSET_BITS) but a quick check of some manpages reveals that there is also fseeko() which would be handy for you since you're using FILE * files, not raw file descriptors. Either way, the best way to seek right to the end of a file is to use one of those functions with SEEK_END.
EDIT: aix make makes an important point which bears repeating: seeking past the end of the file does not make the seek fail.

Resources