What is the difference between ftello/fseeko and fgetpos/fsetpos? Both seem to be file pointer getting/setting functions that use opaque offset types to sometimes allow 64 bit offsets.
Are they supported on different platforms or by different standards? Is one more flexible in the type of the offset it uses?
And, by the way, I am aware of what is difference between fgetpos/fsetpos and ftell/fseek, but this is not a duplicate. That question asks about ftell/fseek, and the answer is not applicable to ftello/fseeko.
See Portable Positioning for detailed information on the difference. An excerpt:
On some systems where text streams truly differ from binary streams, it is impossible to represent the file position of a text stream as a count of characters from the beginning of the file. For example, the file position on some systems must encode both a record offset within the file, and a character offset within the record.
As a consequence, if you want your programs to be portable to these systems, you must observe certain rules:
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
In a call to fseek or fseeko on a text stream, either the offset must be zero, or whence must be SEEK_SET and the offset must be the result of an earlier call to ftell on the same stream.
The value of the file position indicator of a text stream is undefined while there are characters that have been pushed back with ungetc that haven't been read or discarded. See Unreading.
In a nutshell: fgetpos/fsetpos use a more flexible structure to store additional metadata about the file position state, enabling greater portability (in theory).
Related
I'm a beginner in C programming and I have some questions regarding how to deal with files.
Let us suppose that we have a binary file with N int values stored.
Let us suppose that we what to read the i-th in value in the file.
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
Intuitively, I think that fseek is faster. But how the function finds the i-th value in the file without reading the intermediary information?
I think that this is implementation-dependent. So, I tried to find the implementation of fseek function, without much success.
But how the function finds the i-th value in the file without reading the intermediary information?
It doesn't. It's up to you provide the correct (absolute or relative) offset. You can request, for example, to advance the file pointer by i*sizeof(X).
It still needs to follow the chain of sectors in which the file is located to find the right one, but that doesn't require reading those sectors. That metadata is stored outside of the file itself.
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
There are potential benefits at every level.
By seeking, the system may have to read less from the disk. The system reads from the disk in sectors, so short seeks might not have this benefit because of caching. But seeking over entire sectors reduces the amount of data that needs to be fetched from the disk.
Similarly, by seeking, the stdio library my have to request less from the OS.
The stdio library normally reads more than it requires so that future calls to fread doesn't need to touch the OS or the disk. A short seek might not require making any system calls, but seeking beyond the end of the buffered data could reduce the total amount of data fetched from the OS.
Finally, the skipped data doesn't need to be copied from the stdio library's buffers to the user's buffer at all when using fseek, no matter how far you seek.
Oh, and let's not forget that you were considering i-1 reads instead of just a large one. Each of those reads consume CPU, both in the library (error checking) and in the caller (error handling).
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
Yes, if you want to read a value from the file and you know where it is, there is no reason to read anything else.
Intuitively, I think that fseek is faster. But how the function finds the i-th value in the file without reading the intermediary information?
Your intuition is correct, if you read one value it stands to reason that the it will be more efficient than reading several values. The way it finds the value is simple, generally speaking each position in the file corresponds to 1 byte, if you pass an offset of, for example 7, the next read will start from the 8th byte, imagine your file has the following data:
-58 10 12 14 7 9
^ ^
| |
0 offset of 7
fseek(fp, 7, SEEK_SET);
if(fscanf(fp,"%d",&num) == 1 ){
printf("%d", num);
}
Will output 12.
The file indicator was set to the 7th position, then the reading begins from the next byte. It's as if you had an array and you want to access the 7th position, you'll just use arr[7].
I think that this is implementation-dependent.
Though there are some small details that can be implementation defined, the overall behavior is standardised.
§7.21.9.2 The fseek function
Synopsis
1.
#include <stdio.h>
int fseek(FILE *stream, long int offset, int whence);
Description:
The fseek function sets the file position indicator for the stream pointed to by stream. If a read or write error occurs, the error indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the beginning of the file, is obtained by adding offset to the position specified by whence. The specified position is the beginning of the file if whence is SEEK_SET, the current value of the file position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
After determining the new position, a successful call to the fseek function undoes any effects of the ungetc function on the stream, clears the end-of-file indicator for the stream, and then establishes the new position. After a successful fseek call, the next operation on an update stream may be either input or output.
Returns:
The fseek function returns nonzero only for a request that cannot be satisfied.
Lets say we have a file "x" containing the string "0123456789".
We open the file and have a file descriptor fd.
We can do read(fd, some_buffer, 5) to read 5 values into the buffer from the file.
Similarly, we can use fseek to move the pointer to the individual entries in the file.
My question is, what is the behavior of fseek when we used SEEK_END with a positive offset? Is this behavior undefined, or does it wrap around to the front of the contents of the file?
So if we did fseek(fd, 5, SEEK_END), where would the pointer be pointing to now?
My question is, what is the behavior of fseek when we used SEEK_END
with a positive offset? Is this behavior undefined, or does it wrap
around to the front of the contents of the file?
If the stream is a text stream then as far as the C language is concerned, the behavior is undefined, because the standard specfies that:
For a text stream, either offset shall be zero, or offset shall be
a value returned by an earlier successful call to the ftell function
on a stream associated with the same file and whence shall be
SEEK_SET.
(C2011, 7.21.9.2/4). No behavior is defined for the combination of a nonzero offset and SEEK_END.
For a binary stream,
the new position, measured in characters from the beginning of the
file, is obtained by adding offset to the position specified by
whence
(C2011, 7.21.9.2/3), so no, it absolutely does not wrap around. The standard goes on to say that
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END
, so such a call as you describe may (definedly) just fail, returning an error code. If it does succeed, however -- and with some implementations it can be expected to do so for some streams -- then it results in the file position being past the end of the file. Attempts to read at such a position should have the same result as if the position was at EOF. Attempts to write have behavior that are dependent on the open mode of the file (all writes to streams opened in append mode go to the current end of the file) and on the implementation.
On a POSIX system, for example, the system's C implementation is specified to allow positioning streams associated with regular files past the end of the file, and successfully writing at such a position has behavior as if bytes with value 0 were written into all positions between that and the previous end of the file. Furthermore, POSIX does not make any distinction in practice between text and binary streams.
Why not read the documentation?
POSIX allows seeking beyond the existing end of file. If an output is
performed after this seek, any read from the gap will return zero
bytes. Where supported by the filesystem, this creates a sparse file.
fseek(f, 0, SEEK_END);
size = ftell(f);
If ftell(f) tells us the current file position, the size here should be the offset from the end of the file to the beginning. Why is the size not ftell(f)+1? Should not ftell(f) only give us the position of the end of the file?
File positions are like the cursor in a text entry widget: they are in between the bytes of the file. This is maybe easiest to understand if I draw a picture:
This is a hypothetical file. It contains four characters: a, b, c, and d. Each character gets a little box to itself, which we call a "byte". (This file is ASCII.) The fifth box has been crossed out because it's not part of the file yet, but but if you appended a fifth character to the file it would spring into existence.
The valid file positions in this file are 0, 1, 2, 3, and 4. There are five of them, not four; they correspond to the vertical lines before, after, and in between the boxes. When you open the file (assuming you don't use "a"), you start out on position 0, the line before the first byte in the file. When you seek to the end, you arrive at position 4, the line after the last byte in the file. Because we start counting from zero, this is also the number of bytes in the file. (This is one of the several reasons why we start counting from zero, rather than one.)
I am obliged to warn you that there are several reasons why
fseek(fp, 0, SEEK_END);
long int nbytes = ftell(fp);
might not give you the number you actually want, depending on what you mean by "file size" and on the contents of the file. In no particular order:
On Windows, if you open a file in text mode, the numbers you get from ftell on that file are not byte offsets from the beginning of the file; they are more like fgetpos cookies, that can only be used in a subsequent call to fseek. If you need to seek around in a text file on Windows you may be better off opening the file in binary mode and dealing with both DOS and Unix line endings yourself — this is actually my recommendation for production code in general, because it's perfectly possible to have a file with DOS line endings on a Unix system, or vice versa.
On systems where long int is 32 bits, files can easily be bigger than that, in which case ftell will fail, return −1 and set errno to EOVERFLOW. POSIX.1-2001-compliant systems provide a function called ftello that returns an off_t quantity that can represent larger file sizes, provided you put #define _FILE_OFFSET_BITS 64 at the very top of all your source files (before any #includes). I don't know what the Windows equivalent is.
If your file contains characters that are beyond ASCII, then the number of bytes in the file is very likely to be different from the number of characters in the file. (For instance, if the file is encoded in UTF-8, the character 啡 will take up three bytes, Ä will take up either two or three bytes depending on whether it's "composed", and జ్ఞా will take up twelve bytes because, despite being a single grapheme, it's a string of four Unicode code points.) ftell(o) will still tell you the correct number to pass to malloc, if your goal is to read the entire file into memory, but iterating over "characters" will not be so simple as for (i = 0; i < len; i++).
If you are using C's "wide streams" and "wide characters", then, just like text streams on Windows, the numbers you get from ftell on that file are not byte offsets and may not be useful for anything other than subsequent calls to fseek. But wide streams and characters are a bad design anyway; you're actually more likely to be able to handle all the world's languages correctly if you stick to processing UTF-8 by hand in narrow streams and characters.
I'm not sure why fseek()/ftell() is taught as a generic way to get the size of a file. It only works because an implementation defines it to work. POSIX does, for one. Windows does, also, for binary streams - but not for text streams.
It's wrong to not add a caveat or warning to, "This is how you get the number of bytes in a file." Because when a programmer first gets on a system that doesn't define fseek()/ftell() as byte offsets, they're going to have problems. I've seen it.
"But I was told this is how you can always do it."
"Well, no. Whoever taught you was wrong."
Because it is impossible to use fseek()/ftell() to get the size of a file in strictly-conforming C code.
For a binary stream, 7.21.9.2 The fseek function, paragraph 3 of the C standard:
For a binary stream, the new position, measured in characters from the
beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the
beginning of the file if whence is SEEK_SET, the current value of
the file position indicator if SEEK_CUR , or end-of-file if
SEEK_END. A binary stream need not meaningfully support fseek
calls with a whence value of SEEK_END.
Footnote 268 specifically states:
Setting the file position indicator to end-of-file, as with
fseek(file, 0, SEEK_END), has undefined behavior for a binary
stream (because of possible trailing null characters) or for any
stream with state-dependent encoding that does not assuredly end in
the initial shift state.
So you can't seek the the end of a binary stream to get a file's size in bytes.
And for a text stream, 7.21.9.4 The ftell function, paragraph 2 states:
The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary
stream, the value is the number of characters from the
beginning of the file. For a text stream, its file position
indicator contains unspecified information, usable by the fseek
function for returning the file position indicator for the stream to
its position at the time of the ftell call; the difference
between two such return values is not necessarily a meaningful
measure of the number of characters written or read.
So you can't use ftell() on a text stream to get a byte count.
The only strictly-conformant approach that I'm aware of to get the number of bytes in a file is to read them one-by-one with fgetc() and count them.
While I understand that fpos_t is an opaque type intended to be initialized by the fgetpos() function , §7.19.9.1 of the C99 rationale states that:
fgetpos and fsetpos were added to C89 to allow random access operations on files that are too large to handle with fseek and ftell.
and §7.19.9.2:
The need to encode both record position and position within a record in a long value may constrain the size of text files upon which fseek and ftell can be used to be considerably smaller than the size of binary files.
...
fgetpos and fsetpos were added to deal with files that are too large to handle with fseek and ftell.
This seems to primarily focus on text files (files opened with a mode excluding the b flag), because some implementations may require storing two positions (a file record position and a record character position), which could significantly reduce the effective range of the fseek() and ftell() functions for text streams.
Nevertheless, I'm clueless as to how this is particularly useful for text streams, and I certainly don't understand how it could effectively be used for "random access."
It seems the only way to actually utilize these functions is by reading every character of a file and caching their fgetpos()d fpos_t values, which seems niche at best, since you almost certainly don't want to read anywhere near LONG_MAX characters.
What was "the Committee" thinking? Is there a C99 rationale rationale?
I believe that on some (probably archaic mainframe) systems text files are stored as series of "records" (lines) and the file position therefore is made up both a record index and a position within the record, which is what the rationale text seems to be referring to. At the operating system level, the seek operation requires both a record index and position within the record, rather than a byte position within the file; this leads to the problem that both record index and position within must be encoded within a long value for use with fseek and ftell. Therefore, a library implementation needs to assign some number of bits to each of record index and position, and this limits the number of records and the position.
For example, if long has 32 bits, then this might be divided into 25 bits for the record index and 7 bits for the position within the record (allowing a maximum usable record length of 127, and 2^25 ~= 33k records). The system may however allow more and larger records than this.
(Above statements are partly vague recollection, and partly inference from the rationale text).
However, the real problem with fseek and ftell on even modern desktop systems is that a long value may not be enough to represent the full range of file positions. On 32-bit systems long is usually 32 bits, but files can often still grow to be larger than 2GB. Therefore a different mechanism for specifying file offsets is required.
I certainly don't understand how it could effectively be used for "random access."
In this case by "random access" what they are talking about is the ability to seek to any point that has already been visited, that is, you can reposition (using fsetpos) any position that you have already obtained (via fgetpos). It is not about seeking to any arbitrary byte position. Arguably "random access" is the wrong term, but I think they just wanted to distinguish from purely sequential access.
I understand the workings of ftell() and fseek() in C, but for this question I couldn't find any precise answer anywhere, including from the closest post on StackOverflow(LINK).
So can you please answer the following:
Can it be concluded that fgetpos() and fsetpos() are relevant only for text files opened in text mode, and not for files opened in binary mode?
What kind of position information is the fpos_t object filled with by fgetpos(), given that it is not a long integer offset etc. like the one given by ftell()? The site cplusplusreference only tells the following:
The function fills the fpos_t object pointed by pos with the information needed from the stream's position indicator to restore the stream to its current position
fgetpos() and fsetpos() are relevant for both text and binary mode.
The advantage of fgetpos() is that is keeps the full position in the stream, including its internal state, so that you can restore is later. This works whether you are in text mode or not. This is especially important if you are using wide oriented streams or mix fgetc() and fgetwc() in the same file, because some locale use a state dependent multibyte encoding (state depends on previous reads).
fseek() and ftell() can also work with text and binary mode. However there is an important restriction in text mode: you should only use fseek() with 0 or a value previously returned by ftell() (in binary mode you can use whatever value you want). This is because the text mode reading can change the number of bytes returned from reading compared to the bytes effectively in the file (typical example, the 2 CR+LF bytes in a windows file which are converted to a signe LF byte).
As ftell() only returns a long int offset, it can't keep track of the multibyte state if this would be needed. So using fseek() might loose this state.
Not quite. Clues can be found from Beej:
On virtually every system (and certainly every system that I know of),
people don't use these functions, using ftell() and fseek() instead.
These functions exist just in case your system can't remember file
positions as a simple byte offset.
And Linux man pages:
On some non-UNIX systems, an fpos_t object may be a complex object and
these routines may be the only way to portably reposition a text
stream.
And on Windows:
It assumes that any \n character in the buffer was originally a \r\n
sequence that had been normalized when the data was read into the
buffer.
That is to say, files that aren't (Windows-linebreak) text files go wrong in Windows when opened in text mode because fsetpos is assuming the file really was a (Windows-linebreak) text file and therefore cannot contain a \n with no \r.
The C11 standard says (my emphasis):
7.21.2/6:
Each wide-oriented stream has an associated mbstate_t object that
stores the current parse state of the stream. A successful call to
fgetpos stores a representation of the value of this mbstate_t object
as part of the value of the fpos_t object. A later successful call to
fsetpos using the same stored fpos_t value restores the value of the
associated mbstate_t object as well as the position within the
controlled stream.
Note that fseek and ftell have nothing to say about the mbstate_t object: they do not report or restore it. So on wide-oriented streams (that is to say, streams on which you've used wide-oriented I/O functions) they only reset the file position, not (if the implementation actually has more than one possible value of a mbstate_t object) the whole state of the stream.
Wide-oriented streams aren't the same thing as text streams, it's just that reading wide text files is the common use for them. Actually fseek and ftell are documented to be able to reset the file position on text files, provided you use them correctly. So I believe (I might be wrong) that fsetpos and fgetpos are only required when using wide I/O functions on the stream.
Besides the reasons mentioned in the other answers, it might be necessary to use fgetpos and fsetpos if you are working with very large files, files that contain more than LONG_MAX bytes. This is a real concern on systems where LONG_MAX is 231 − 1; files with more than two billion bytes in them are not that uncommon nowadays.
If you are on a system that implements POSIX.1-2001, there is a better alternative, which is to #define _FILE_OFFSET_BITS 64 before including any system header files, and then use fseeko and ftello. These are just like fseek and ftell except that they take/return an off_t quantity, which, provided you have made the above #define, is guaranteed to be an integer type that can represent 263 − 1, which ought to be enough for anybody. This is better because you can do arithmetic on off_t; you can't use fpos_t to go somewhere you haven't already been. But if you're not on a POSIX system, fgetpos and fsetpos might be your only option.
(Note that some systems will give you an fpos_t that can't represent a file offset greater than LONG_MAX bytes. On some of those, applying the same #define _FILE_OFFSET_BITS 64 setting will help. On others, you're just completely out of luck if you want a huge file.)