Use ftell to find the file size - c

fseek(f, 0, SEEK_END);
size = ftell(f);
If ftell(f) tells us the current file position, the size here should be the offset from the end of the file to the beginning. Why is the size not ftell(f)+1? Should not ftell(f) only give us the position of the end of the file?

File positions are like the cursor in a text entry widget: they are in between the bytes of the file. This is maybe easiest to understand if I draw a picture:
This is a hypothetical file. It contains four characters: a, b, c, and d. Each character gets a little box to itself, which we call a "byte". (This file is ASCII.) The fifth box has been crossed out because it's not part of the file yet, but but if you appended a fifth character to the file it would spring into existence.
The valid file positions in this file are 0, 1, 2, 3, and 4. There are five of them, not four; they correspond to the vertical lines before, after, and in between the boxes. When you open the file (assuming you don't use "a"), you start out on position 0, the line before the first byte in the file. When you seek to the end, you arrive at position 4, the line after the last byte in the file. Because we start counting from zero, this is also the number of bytes in the file. (This is one of the several reasons why we start counting from zero, rather than one.)
I am obliged to warn you that there are several reasons why
fseek(fp, 0, SEEK_END);
long int nbytes = ftell(fp);
might not give you the number you actually want, depending on what you mean by "file size" and on the contents of the file. In no particular order:
On Windows, if you open a file in text mode, the numbers you get from ftell on that file are not byte offsets from the beginning of the file; they are more like fgetpos cookies, that can only be used in a subsequent call to fseek. If you need to seek around in a text file on Windows you may be better off opening the file in binary mode and dealing with both DOS and Unix line endings yourself — this is actually my recommendation for production code in general, because it's perfectly possible to have a file with DOS line endings on a Unix system, or vice versa.
On systems where long int is 32 bits, files can easily be bigger than that, in which case ftell will fail, return −1 and set errno to EOVERFLOW. POSIX.1-2001-compliant systems provide a function called ftello that returns an off_t quantity that can represent larger file sizes, provided you put #define _FILE_OFFSET_BITS 64 at the very top of all your source files (before any #includes). I don't know what the Windows equivalent is.
If your file contains characters that are beyond ASCII, then the number of bytes in the file is very likely to be different from the number of characters in the file. (For instance, if the file is encoded in UTF-8, the character 啡 will take up three bytes, Ä will take up either two or three bytes depending on whether it's "composed", and జ్ఞా will take up twelve bytes because, despite being a single grapheme, it's a string of four Unicode code points.) ftell(o) will still tell you the correct number to pass to malloc, if your goal is to read the entire file into memory, but iterating over "characters" will not be so simple as for (i = 0; i < len; i++).
If you are using C's "wide streams" and "wide characters", then, just like text streams on Windows, the numbers you get from ftell on that file are not byte offsets and may not be useful for anything other than subsequent calls to fseek. But wide streams and characters are a bad design anyway; you're actually more likely to be able to handle all the world's languages correctly if you stick to processing UTF-8 by hand in narrow streams and characters.

I'm not sure why fseek()/ftell() is taught as a generic way to get the size of a file. It only works because an implementation defines it to work. POSIX does, for one. Windows does, also, for binary streams - but not for text streams.
It's wrong to not add a caveat or warning to, "This is how you get the number of bytes in a file." Because when a programmer first gets on a system that doesn't define fseek()/ftell() as byte offsets, they're going to have problems. I've seen it.
"But I was told this is how you can always do it."
"Well, no. Whoever taught you was wrong."
Because it is impossible to use fseek()/ftell() to get the size of a file in strictly-conforming C code.
For a binary stream, 7.21.9.2 The fseek function, paragraph 3 of the C standard:
For a binary stream, the new position, measured in characters from the
beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the
beginning of the file if whence is SEEK_SET, the current value of
the file position indicator if SEEK_CUR , or end-of-file if
SEEK_END. A binary stream need not meaningfully support fseek
calls with a whence value of SEEK_END.
Footnote 268 specifically states:
Setting the file position indicator to end-of-file, as with
fseek(file, 0, SEEK_END), has undefined behavior for a binary
stream (because of possible trailing null characters) or for any
stream with state-dependent encoding that does not assuredly end in
the initial shift state.
So you can't seek the the end of a binary stream to get a file's size in bytes.
And for a text stream, 7.21.9.4 The ftell function, paragraph 2 states:
The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary
stream, the value is the number of characters from the
beginning of the file. For a text stream, its file position
indicator contains unspecified information, usable by the fseek
function for returning the file position indicator for the stream to
its position at the time of the ftell call; the difference
between two such return values is not necessarily a meaningful
measure of the number of characters written or read.
So you can't use ftell() on a text stream to get a byte count.
The only strictly-conformant approach that I'm aware of to get the number of bytes in a file is to read them one-by-one with fgetc() and count them.

Related

What is the advange of using fseek over using a sequence of fread in C?

I'm a beginner in C programming and I have some questions regarding how to deal with files.
Let us suppose that we have a binary file with N int values stored.
Let us suppose that we what to read the i-th in value in the file.
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
Intuitively, I think that fseek is faster. But how the function finds the i-th value in the file without reading the intermediary information?
I think that this is implementation-dependent. So, I tried to find the implementation of fseek function, without much success.
But how the function finds the i-th value in the file without reading the intermediary information?
It doesn't. It's up to you provide the correct (absolute or relative) offset. You can request, for example, to advance the file pointer by i*sizeof(X).
It still needs to follow the chain of sectors in which the file is located to find the right one, but that doesn't require reading those sectors. That metadata is stored outside of the file itself.
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
There are potential benefits at every level.
By seeking, the system may have to read less from the disk. The system reads from the disk in sectors, so short seeks might not have this benefit because of caching. But seeking over entire sectors reduces the amount of data that needs to be fetched from the disk.
Similarly, by seeking, the stdio library my have to request less from the OS.
The stdio library normally reads more than it requires so that future calls to fread doesn't need to touch the OS or the disk. A short seek might not require making any system calls, but seeking beyond the end of the buffered data could reduce the total amount of data fetched from the OS.
Finally, the skipped data doesn't need to be copied from the stdio library's buffers to the user's buffer at all when using fseek, no matter how far you seek.
Oh, and let's not forget that you were considering i-1 reads instead of just a large one. Each of those reads consume CPU, both in the library (error checking) and in the caller (error handling).
Is there any real advantage of using fseek for positioning the file pointer to the i-th int value and reading it after the fseek instead of using a sequence of i fread calls?
Yes, if you want to read a value from the file and you know where it is, there is no reason to read anything else.
Intuitively, I think that fseek is faster. But how the function finds the i-th value in the file without reading the intermediary information?
Your intuition is correct, if you read one value it stands to reason that the it will be more efficient than reading several values. The way it finds the value is simple, generally speaking each position in the file corresponds to 1 byte, if you pass an offset of, for example 7, the next read will start from the 8th byte, imagine your file has the following data:
-58 10 12 14 7 9
^ ^
| |
0 offset of 7
fseek(fp, 7, SEEK_SET);
if(fscanf(fp,"%d",&num) == 1 ){
printf("%d", num);
}
Will output 12.
The file indicator was set to the 7th position, then the reading begins from the next byte. It's as if you had an array and you want to access the 7th position, you'll just use arr[7].
I think that this is implementation-dependent.
Though there are some small details that can be implementation defined, the overall behavior is standardised.
§7.21.9.2 The fseek function
Synopsis
1.
#include <stdio.h>
int fseek(FILE *stream, long int offset, int whence);
Description:
The fseek function sets the file position indicator for the stream pointed to by stream. If a read or write error occurs, the error indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the beginning of the file, is obtained by adding offset to the position specified by whence. The specified position is the beginning of the file if whence is SEEK_SET, the current value of the file position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
After determining the new position, a successful call to the fseek function undoes any effects of the ungetc function on the stream, clears the end-of-file indicator for the stream, and then establishes the new position. After a successful fseek call, the next operation on an update stream may be either input or output.
Returns:
The fseek function returns nonzero only for a request that cannot be satisfied.

fseek() on text files in C language

I'm starting to learng C and i'm using text files. Loading text from them, work with it and then updating it in the same file. I was told that fseek() is not guaranteed to work everytime in text files, and i can't truly understand why. If someone could explain this it would be great!!
I also found that if you do
pos = ftell(file);
fseek(pos);
it's guaranteed to move the pointer to "pos". Is this right?
Ffseek() is used to move file pointer associated with a given file to a specific position.syntax :fseek(FILE *pointer, long int offset, int position),offset: number of bytes to offset from position , position: position from where offset is added.
position defines the point with respect to which the file pointer needs to be moved. It has three values:
SEEK_END : It denotes end of the file.
SEEK_SET : It denotes starting of the file.
SEEK_CUR : It denotes file pointer’s current position.
 // Moving pointer to end
We need to specify postion
    fseek(fp, 0, SEEK_END);
    // Printing position of pointer
    printf("%ld", ftell(fp));
What the thing you have return is also write but specifiying offset position is much better
I was told that fseek() is not guaranteed to work every time in text files.
This is true, but an explanation is required:
Some legacy systems use multiple bytes to encode the end of line in text files, or some other scheme such as fixed length records... This makes file offsets differ from the number of bytes read from the stream. In fact, some file offsets are meaningless in text files, such as the offset of the LF byte in a CR/LF sequence. This feature is also a problem when writing text files and is more so in update mode when using the same stream pointer to read and write to the same file.
This was never a problem on Unix systems where text files and binary files are just sequences of bytes and line endings represented by a single newline byte.
When porting the C language to other operating systems, compiler vendors came up with various elaborate tricks to handle the translation of system specific line endings to a single '\n' byte.
As these tricks where system specific or even vendor specific, no standard approach could be standardized in 1989 when the ANSI drafted the first C Standard. They just agreed on the b flag for the fopen() mode argument and removed any constraint on the meaning of the return value of ftell() beyond simple constraints:
C19 7.21.9.2 The fseek function
Synopsis
#include <stdio.h>
int fseek(FILE *stream, long int offset, int whence);
Description
The fseek function sets the file position indicator for the stream pointed to by stream. If a read or write error occurs, the error indicator for the stream is set and fseek fails.
For a binary stream, the new position, measured in characters from the beginning of the file, is obtained by adding offset to the position specified by whence. The specified position is the beginning of the file if whence is SEEK_SET, the current value of the file position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
Note that fseek(pos); in your question is incorrect as the stream and whence argument are missing. You should write:
long pos = ftell(file);
...
fseek(file, pos, SEEK_SET); // move back to position <pos>
Note however that fseek() can fail for other reasons: not all streams support seeking, such as pipes, terminal connections and other character devices... File offsets may exceed the range of type long, especially on legacy systems, fgetpos() / fsetpos() is a preferred alternative if available.

FSEEK offset accepts more than what it should accept

Following the Specification:
For a text stream, either offset shall be zero, or offset shall be a value returned by an earlier successful call to the ftell function on a stream associated with the same file and whence shall be SEEK_SET.
I understand that offset must be the retun value of a ftell function, or 0, and whence must be SEET_SET (or 0). But I used some integers as offsets and different SEEK_... and it seemed to work well.
For example, these worked:
fseek(file, 4, SEEK_CUR);
fseek(file, -1, SEEK_END);
fseek(file, 0, SEEK_CUR);
When I read the specification it seems to me that it should not work. I tried to use fseek this way many times, and it never failed. Why does it work, what point am I not getting?
In the ftell documentation you can read
For text streams, the numerical value may not be meaningful but can
still be used to restore the position to the same position later using
fseek (if there are characters put back using ungetc still pending of
being read, the behavior is undefined).
What you cited means that it may have sense to use it if you know where you want to place your pointer at, and you may know it because in precedence you invoked ftell().
All your calls to fseek are valid, but in a text file it has not much sense to move using fseek because it is not a random-access (binary) file, but still this does not mean that it is wrong to use it.
For a text file, you can find here the most common functions to access it, like fscanf(), fprintf() and so on.
When I read the specification it seems to me that it should not work.
The specification, states what must work. It should be seen as the minimum requirements for someone creating a c library (i.e. the implementor of fseek et al).
Incorrect use might still work, but there is no guarantee. The result would depend on the platform.
For instance, the Linux manual page for fseeksays:
The fseek() function sets the file position indicator for the stream pointed to by stream. The new position, measured in bytes, is obtained by adding offset bytes to the position specified by whence. If whence is set to SEEK_SET, SEEK_CUR, or SEEK_END, the offset is relative to the start of the file, the current position indicator, or end-of-file, respectively. A successful call to the fseek() function clears the end-of-file indicator for the stream and undoes any effects of the ungetc(3) function on the same stream.
A you can see, the things you tried will work in Linux for both text and binary streams. However, there may exist platforms where fseek won't work with SEEK_CUR or SEEK_END for text streams.
Note also that a stream could be associated with different things: a file, a keyboard, a socket, a terminal window, a device, etc.
All your fseek calls are valid. The number you provide as the second argument is an offset, meaning it is relative to the seek type that your provide as the third parameter.
fseek(file, 4, SEEK_CUR); // seek 4 bytes forward from current position
fseek(file, -1, SEEK_END); // seek to 1 byte before the end of the file
fseek(file, 0, SEEK_CUR); // does nothing.
But see also user Tu.ma's explanation that the seek positions are not accurate and/or can be meaningless if the file has been opened in text mode (especially under Windows because of carriage return/line feed translation).
There is nothing stopping you using fseek to go beyond the current size of the file. Because doing so allows you to write data at that point, filling the as yet unwritten gap with NULs. Like with this example code - it creates a file with 1000 NULs and then "hello\n"
#include <stdio.h>
int main(void)
{
FILE *f;
f=fopen("test","w");
if(f)
{
fseek(f,1000,SEEK_SET);
fprintf(f,"hello\n");
fclose(f);
}
else
{
perror("fopen");
}
}
I think the main reason why the definition of fseek is as it is in the C standard is that your logical position in a text file may not correlate to the physical number of bytes from the start of the text file.
For example, in Windows implementations, it is not uncommon to convert\r\n in the file on disk to just \n to maintain compatibility with Unix line endings. So if your file looks like this:
hello\r\nworld
i.e. two lines, and you fseek to position 6, would you expect to be on the \n or the w? If you tried to find out by using fgetc on Windows to count the characters, you'd assume you would be on the w. But fseek might chance advance to byte 6 without scanning for line endings.
Edit
And if we use the fgetc function, each character that we read increases our position of 1: the file cursor goes to the next character after the previous one was read. Is that a problem?
Yes. The problem is in the definition of "character". If you are in an environment that uses DOS conventions, using fgetc on a text stream when the next two bytes are 0x0d 0x0a advances the file position by two but only returns the 0x0a. There may be other conversions that the implementation chooses to make, like turning decomposed Unicode into precomposed unicode or vice versa.
The wording in the C standard allows implementations to lose the one to one mapping between bytes in the file and characters returned by fgetc without having to overcomplicate fseek.

Possible to read a whole file by fseek()ing to SEEK_END and obtaining the file size by ftell()?

Am I right that this code introduces undefined behavior?
#include <stdio.h>
#include <stdlib.h>
FILE *f = fopen("textfile.txt", "rb");
fseek(f, 0, SEEK_END);
long fsize = ftell(f);
fseek(f, 0, SEEK_SET); //same as rewind(f);
char *string = malloc(fsize + 1);
fread(string, fsize, 1, f);
fclose(f);
string[fsize] = 0;
The reason I'm asking is that this code is posted as an accepted and highly-upvoted answer to the following question: C Programming: How to read the whole file contents into a buffer
However, according to the following article: How to read an entire file into memory in C++ (which, despite its title, also deals with C, so stick with me):
Suppose you were writing C, and you had a FILE* (that you know points
to a file stream, or at least a seekable stream), and you wanted to
determine how many characters to allocate in a buffer to store the
entire contents of the stream. Your first instinct would probably be
to write code like this:
// Bad code; undefined behaviour
fseek(p_file, 0, SEEK_END);
long file_size = ftell(p_file);
Seems legit. But then you start getting weirdness. Sometimes the
reported size is bigger than the actual file size on disk. Sometimes
it’s the same as the actual file size, but the number of characters
you read in is different. What the hell is going on?
There are two answers, because it depends on whether the file has been
opened in text mode or binary mode.
Just in case you donlt know the difference: in the default mode – text
mode – on certain platforms, certain characters get translated in
various ways during reading. The most well-known is that on Windows,
newlines get translated to \r\n when written to a file, and
translated the other way when read. In other words, if the file
contains Hello\r\nWorld, it will be read as Hello\nWorld; the file
size is 12 characters, the string size is 11. Less well-known is that
0x1A (or Ctrl-Z) is interpreted as the end of the file, so if the file
contains Hello\x1AWorld, it will be read as Hello. Also, if the
string in memory is Hello\x1AWorld and you write it to a file in
text mode, the file will be Hello. In binary mode, no
translations are done – whatever is in the file gets read in to your
program, and vice versa.
Immediately you can guess that text mode is going to be a headache –
on Windows, at least. More generally, according to the C standard:
The ftell function obtains the current value of the file position indicator for the stream pointed to by stream. For a binary stream,
the value is the number of characters from the beginning of the file.
For a text stream, its file position indicator contains unspecified
information, usable by the fseek function for returning the file
position indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is not
necessarily a meaningful measure of the number of characters written
or read.
In other words, when you’re dealing with a file opened in text mode,
the value that ftell() returns is useless… except in calls to fseek().
In particular, it doesn’t necessarily tell you how many characters are
in the stream up to the current point.
So you can’t use the return value from ftell() to tell you the size of
the file, the number of characters in the file, or for anything
(except in a later call to fseek()). So you can’t get the file size
that way.
Okay, so to hell with text mode. What say we work in binary mode only?
As the C standard says: "For a binary stream, the value is the number
of characters from the beginning of the file." That sounds promising.
And, indeed, it is. If you are at the end of the file, and you call
ftell(), you will find the number of bytes in the file. Huzzah!
Success! All we need to do now is get to the end of the file. And to
do that, all you need to do is fseek() with SEEK_END, right?
Wrong.
Once again, from the C standard:
Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream
(because of possible trailing null characters) or for any stream with
state-dependent encoding that does not assuredly end in the initial
shift state.
To understand why this is the case: Some platforms store files as
fixed-size records. If the file is shorter than the record size, the
rest of the block is padded. When you seek to the “end”, for
efficiency’s sake it just jumps you right to the end of the last
block… possibly long after the actual end of the data, after a bunch
of padding.
So, here’s the situation in C:
You can’t get the number of characters with ftell() in text mode.
You can get the number of characters with ftell() in binary mode… but you can’t seek to the end of the file with fseek(p_file, 0,
SEEK_END).
I don't have enough knowledge to judge who's right here, and if the aforemented accepted answer indeed clashes with this article, so I'm asking this question.
What the author of the article is maliciously omitting is the context of the quote.
From the C11 draft standard n1570, NON-NORMATIVE FOOTNOTE 268:
Setting the file position indicator to end-of-file, as with
fseek(file, 0, SEEK_END), has undefined behavior for a binary stream
(because of possible trailing null characters) or for any stream with
state-dependent encoding that does not assuredly end in the initial
shift state.
The normative part of the standard that refers to the footnote is this 7.21.3 Files:
9 Although both text and binary wide-oriented streams are conceptually
sequences of wide characters, the external file associated with a
wide-oriented stream is a sequence of multibyte characters,
generalized as follows:
— Multibyte encodings within files may contain
embedded null bytes (unlike multibyte encodings valid for use internal
to the program).
— A file need not begin nor end in the initial shift state. 268)
Note that this concerns wide-oriented streams.
Now, in 7.21.9.2 The fseek function
3 For a binary stream, the new position, measured in characters from
the beginning of the file, is obtained by adding offset to the
position specified by whence. The specified position is the beginning
of the file if whence is SEEK_SET, the current value of the file
position indicator if SEEK_CUR, or end-of-file if SEEK_END. A binary
stream need not meaningfully support fseek calls with a whence value
of SEEK_END.
The language is a considerably less dire final sentence:
"A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END."

ftello/fseeko vs fgetpos/fsetpos

What is the difference between ftello/fseeko and fgetpos/fsetpos? Both seem to be file pointer getting/setting functions that use opaque offset types to sometimes allow 64 bit offsets.
Are they supported on different platforms or by different standards? Is one more flexible in the type of the offset it uses?
And, by the way, I am aware of what is difference between fgetpos/fsetpos and ftell/fseek, but this is not a duplicate. That question asks about ftell/fseek, and the answer is not applicable to ftello/fseeko.
See Portable Positioning for detailed information on the difference. An excerpt:
On some systems where text streams truly differ from binary streams, it is impossible to represent the file position of a text stream as a count of characters from the beginning of the file. For example, the file position on some systems must encode both a record offset within the file, and a character offset within the record.
As a consequence, if you want your programs to be portable to these systems, you must observe certain rules:
The value returned from ftell on a text stream has no predictable relationship to the number of characters you have read so far. The only thing you can rely on is that you can use it subsequently as the offset argument to fseek or fseeko to move back to the same file position.
In a call to fseek or fseeko on a text stream, either the offset must be zero, or whence must be SEEK_SET and the offset must be the result of an earlier call to ftell on the same stream.
The value of the file position indicator of a text stream is undefined while there are characters that have been pushed back with ungetc that haven't been read or discarded. See Unreading.
In a nutshell: fgetpos/fsetpos use a more flexible structure to store additional metadata about the file position state, enabling greater portability (in theory).

Resources