How does fread really work? - c

The declaration of fread is as following:
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
The question is: Is there a difference in reading performance of two such calls to fread:
char a[1000];
fread(a, 1, 1000, stdin);
fread(a, 1000, 1, stdin);
Will it read 1000 bytes at once each time?

There may or may not be any difference in performance. There is a difference in semantics.
fread(a, 1, 1000, stdin);
attempts to read 1000 data elements, each of which is 1 byte long.
fread(a, 1000, 1, stdin);
attempts to read 1 data element which is 1000 bytes long.
They're different because fread() returns the number of data elements it was able to read, not the number of bytes. If it reaches end-of-file (or an error condition) before reading the full 1000 bytes, the first version has to indicate exactly how many bytes it read; the second just fails and returns 0.
In practice, it's probably just going to call a lower-level function that attempts to read 1000 bytes and indicates how many bytes it actually read. For larger reads, it might make multiple lower-level calls. The computation of the value to be returned by fread() is different, but the expense of the calculation is trivial.
There may be a difference if the implementation can tell, before attempting to read the data, that there isn't enough data to read. For example, if you're reading from a 900-byte file, the first version will read all 900 bytes and return 900, while the second might not bother to read anything. In both cases, the file position indicator is advanced by the number of characters successfully read, i.e., 900.
But in general, you should probably choose how to call it based on what information you need from it. Read a single data element if a partial read is no better than not reading anything at all. Read in smaller chunks if partial reads are useful.

According to the specification, the two may be treated differently by the implementation.
If your file is less than 1000 bytes, fread(a, 1, 1000, stdin) (read 1000 elements of 1 byte each) will still copy all the bytes until EOF. On the other hand, the result of fread(a, 1000, 1, stdin) (read 1 1000-byte element) stored in a is unspecified, because there is not enough data to finish reading the 'first' (and only) 1000 byte element.
Of course, some implementations may still copy the 'partial' element into as many bytes as needed.

That would be implementation detail. In glibc, the two are identical in performance, as it's implemented basically as (Ref http://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofread.c):
size_t fread (void* buf, size_t size, size_t count, FILE* f)
{
size_t bytes_requested = size * count;
size_t bytes_read = read(f->fd, buf, bytes_requested);
return bytes_read / size;
}
Note that the C and POSIX standard does not guarantee a complete object of size size need to be read every time. If a complete object cannot be read (e.g. stdin only has 999 bytes but you've requested size == 1000), the file will be left in an interdeterminate state (C99 ยง7.19.8.1/2).
Edit: See the other answers about POSIX.

fread calls getc internally. in Minix number of times getc is called is simply size*nmemb so how many times getc will be called depends on the product of these two. So Both fread(a, 1, 1000, stdin) and fread(a, 1000, 1, stdin) will run getc 1000=(1000*1) Times.
Here is the siimple implementation of fread from Minix
size_t fread(void *ptr, size_t size, size_t nmemb, register FILE *stream){
register char *cp = ptr;
register int c;
size_t ndone = 0;
register size_t s;
if (size)
while ( ndone < nmemb ) {
s = size;
do {
if ((c = getc(stream)) != EOF)
*cp++ = c;
else
return ndone;
} while (--s);
ndone++;
}
return ndone;
}

There may be no performance difference, but those calls are not the same.
fread returns the number of elements read, so those calls will return different values.
If an element cannot be completely read, its value is indeterminate:
If an error occurs, the resulting value of the file position indicator for the stream is
indeterminate. If a partial element is read, its value is indeterminate. (ISO/IEC 9899:TC2 7.19.8.1)
There's not much difference in the glibc implementation, which just multiplies the element size by the number of elements to determine how many bytes to read and divides the amount read by the member size in the end. But the version specifying an element size of 1 will always tell you the correct number of bytes read. However, if you only care about completely read elements of a certain size, using the other form saves you from doing a division.

One more sentence form http://pubs.opengroup.org/onlinepubs/000095399/functions/fread.html is notable
The fread() function shall read into the array pointed to by ptr up to nitems elements whose size is specified by size in bytes, from the stream pointed to by stream. For each object, size calls shall be made to the fgetc() function and the results stored, in the order read, in an array of unsigned char exactly overlaying the object.
Inshort in both case data will be accessed by fgetc()...!

I wanted to clarify the answers here. fread performs buffered IO. The actual read block sizes fread uses are determined by the C implementation being used.
All modern C libraries will have the same performance with the two calls:
fread(a, 1, 1000, file);
fread(a, 1000, 1, file);
Even something like:
for (int i=0; i<1000; i++)
a[i] = fgetc(file)
Should result in the same disk access patterns, although fgetc would be slower due to more calls into the standard c libraries and in some cases the need for a disk to perform additional seeks which would have otherwise been optimized away.
Getting back to the difference between the two forms of fread. The former returns the actual number of bytes read. The latter returns 0 if the file size is less than 1000, otherwise it returns 1. In both cases the buffer would be filled with the same data, i.e. the contents of the file up to 1000 bytes.
In general, you probably want to keep the 2nd parameter (size) set to 1 such that you get the number of bytes read.

Related

Why does fread() give seemingly random data?

I have the following C code that opens a file in rb+ mode, then writes 100 bytes of value 0. When I read the file with an offset of anything other than 0, I get 96. Why is this?
FILE *fp = fopen("myfile", "rb+");
rewind(fp);
char zero = 0;
fwrite(&zero, 1, 100, fp);
char result;
fseek(fp, 1, SEEK_SET);
fread(&result, 1, 1, fp);
printf("%d\n", result);
I'm on Linux x64 using GCC.
From your clarification in the comments, your intent was to write 100 zero bytes to the file. There are at least two and possibly three ways to do this.
The first is to allocate an array of 100 zero-initialized bytes, and write that:
char zeroes[100] = { 0 };
fwrite(zeroes, sizeof(char) /* == 1 */, sizeof(zeroes), f);
This doesn't scale well if you want to write, say, 10,000 or 10,000,000 zero bytes. You could also do this:
char zero = 0;
for (int i = 0; i < 100; ++i) fwrite(&zero, sizeof(char), 1, f);
This scales better, but performs very badly since it's always more efficient to do a single large write than many tiny writes. Instead, you can seek to a later position in the file, and then write only the last byte. On POSIX systems, this is guaranteed to fill the earlier unwritten portion of the file with zeroes:
char zero = 0;
fseek(f, 99, SEEK_SET);
fwrite(&zero, sizeof(char) /* == 1 */, 1, f);
I believe the zero-fill guarantee is also given for Windows' MSVCRT runtime, but I can't immediately find proof of that on MSDN (this might make a good question). If someone knows whether Windows, other platforms, and/or some version(s) of the C standard itself make or do not make this guarantee, this answer could be improved.
Of course, if you are on a POSIX system and don't need portable code, you can use ftruncate() which makes the same guarantee without even needing to do an fwrite(). Windows has SetEndOfFile() but that function fills the extended portion of the file with undefined values, not zero bytes.
You probably want something like:
int i; for (i=0;i<100;++i){fwrite(&zero, 1, 1, fp);}
You cannot write 100 bytes from a pointer that points to a single char.

C - How to handle last part of file if buffer is bigger?

isn't it possible to read bytes left in a file that is smaller than buffer size?
char * buffer = (char *)malloc(size);
FILE * fp = fopen(filename, "rb");
while(fread(buffer, size, 1, fp)){
// do something
}
Let's assume size is 4 and file size is 17 bytes. I thought fread can handle last operation as well even if bytes left in file is smaller than buffer size, but apparently it just terminates while loop without reading one last byte.
I tried to use lower system call read() but I couldn't read any byte for some reason.
What should I do if fread cannot handle last part of bytes that is smaller than buffer size?
Yep, turn your parameters around.
Instead of requesting one block of size bytes, you should request size blocks of 1 bytes. Then the function will return how many blocks (bytes) it was able to read:
int nread;
while( 0 < (nread = fread(buffer, 1, size, fp)) ) ...
try using "man fread"
it clearly mention following things which itself answers your question:
SYNOPSIS
size_t fread(void *ptr, size_t size, size_t nitems, FILE *stream);
DESCRIPTION
fread() copies, into an array pointed to by ptr, up to nitems items of
data from the named input stream, where an item of data is a sequence
of bytes (not necessarily terminated by a null byte) of length size.
fread() stops appending bytes if an end-of-file or error condition is
encountered while reading stream, or if nitems items have been read.
fread() leaves the file pointer in stream, if defined, pointing to the
byte following the last byte read if there is one.
The argument size is typically sizeof(*ptr) where the pseudo-function
sizeof specifies the length of an item pointed to by ptr.
RETURN VALUE
fread(), return the number of items read.If size or nitems is 0, no
characters are read or written and 0 is returned.
The value returned will be less than nitems only if a read error or
end-of-file is encountered. The ferror() or feof() functions must be
used to distinguish between an error condition and an end-of-file
condition.

Go to a certain point of a binary file in C (using fseek) and then reading from that location (using fread)

I am wondering if this is the best way to go about solving my problem.
I know the values for particular offsets of a binary file where the information I want is held...What I want to do is jump to the offsets and then read a certain amount of bytes, starting from that location.
After using google, I have come to the conclusion that my best bet is to use fseek() to move to the position of the offset, and then to use fread() to read an amount of bytes from that position.
Am I correct in thinking this? And if so, how is best to go about doing so? i.e. how to incorporate the two together.
If I am not correct, what would you suggest I do instead?
Many thanks in advance for your help.
Matt
Edit:
I followed a tutorial on fread() and adjusted it to the following:
`#include <stdio.h>
int main()
{
FILE *f;
char buffer[11];
if (f = fopen("comm_array2.img", "rt"))
{
fread(buffer, 1, 10, f);
buffer[10] = 0;
fclose(f);
printf("first 10 characters of the file:\n%s\n", buffer);
}
return 0;
}`
So I used the file 'comm_array2.img' and read the first 10 characters from the file.
But from what I understand of it, this goes from start-of-file, I want to go from some-place-in-file (offset)
Is this making more sense?
Edit Number 2:
It appears that I was being a bit dim, and all that is needed (it would seem from my attempt) is to put the fseek() before the fread() that I have in the code above, and it seeks to that location and then reads from there.
If you are using file streams instead of file descriptors, then you can write yourself a (simple) function analogous to the POSIX pread() system call.
You can easily emulate it using streams instead of file descriptors1. Perhaps you should write yourself a function such as this (which has a slightly different interface from the one I suggested in a comment):
size_t fpread(void *buffer, size_t size, size_t mitems, size_t offset, FILE *fp)
{
if (fseek(fp, offset, SEEK_SET) != 0)
return 0;
return fread(buffer, size, nitems, fp);
}
This is a reasonable compromise between the conventions of pread() and fread().
What would the syntax of the function call look like? For example, reading from the offset 732 and then again from offset 432 (both being from start of the file) and filestream called f.
Since you didn't say how many bytes to read, I'm going to assume 100 each time. I'm assuming that the target variables (buffers) are buffer1 and buffer2, and that they are both big enough.
if (fpread(buffer1, 100, 1, 732, f) != 1)
...error reading at offset 732...
if (fpread(buffer2, 100, 1, 432, f) != 1)
...error reading at offset 432...
The return count is the number of complete units of 100 bytes each; either 1 (got everything) or 0 (something went awry).
There are other ways of writing that code:
if (fpread(buffer1, sizeof(char), 100, 732, f) != 100)
...error reading at offset 732...
if (fpread(buffer2, sizeof(char), 100, 432, f) != 100)
...error reading at offset 432...
This reads 100 single bytes each time; the test ensures you got all 100 of them, as expected. If you capture the return value in this second example, you can know how much data you did get. It would be very surprising if the first read succeeded and the second failed; some other program (or thread) would have had to truncate the file between the two calls to fpread(), but funnier things have been known to happen.
1 The emulation won't be perfect; the pread() call provides guaranteed atomicity that the combination of fseek() and fread() will not provide. But that will seldom be a problem in practice, unless you have multiple processes or threads concurrently updating the file while you are trying to position and read from it.
It frequently depends on the distance between the parts you care about. If you're only skipping over/ignoring a few bytes between the parts you care about, it's often easier to just read that data and ignore what you read, rather than using fseek to skip past it. A typical way to do this is define a struct holding both the data you care about, and place-holders for the ones you don't care about, read in the struct, and then just use the parts you care about:
struct whatever {
long a;
long ignore;
short b;
} w;
fread(&w, 1, sizeof(w), some_file);
// use 'w.a' and 'w.b' here.
If there's any great distance between the parts you care about, though, chances are that your original idea of using fseek to get to the parts that matter will be simpler.
Your theory sounds correct. Open, seek, read, close.
Create a struct to for the data you want to read and pass a pointer to read() of struct's allocated memory. You'll likely need #pragma pack(1) or similar on the struct to prevent misalignment problems.

Reading a binary file 1 byte at a time

I am trying to read a binary file in C 1 byte at a time and after searching the internet for hours I still can not get it to retrieve anything but garbage and/or a seg fault. Basically the binary file is in the format of a list that is 256 items long and each item is 1 byte (an unsigned int between 0 and 255). I am trying to use fseek and fread to jump to the "index" within the binary file and retrieve that value. The code that I have currently:
unsigned int buffer;
int index = 3; // any index value
size_t indexOffset = 256 * index;
fseek(file, indexOffset, SEEK_SET);
fread(&buffer, 256, 1, file);
printf("%d\n", buffer);
Right now this code is giving me random garbage numbers and seg faulting. Any tips as to how I can get this to work right?
Your confusing bytes with int. The common term for a byte is an unsigned char. Most bytes are 8-bits wide. If the data you are reading is 8 bits, you will need to read in 8 bits:
#define BUFFER_SIZE 256
unsigned char buffer[BUFFER_SIZE];
/* Read in 256 8-bit numbers into the buffer */
size_t bytes_read = 0;
bytes_read = fread(buffer, sizeof(unsigned char), BUFFER_SIZE, file_ptr);
// Note: sizeof(unsigned char) is for emphasis
The reason for reading all the data into memory is to keep the I/O flowing. There is an overhead associated with each input request, regardless of the quantity requested. Reading one byte at a time, or seeking to one position at a time is the worst case.
Here is an example of the overhead required for reading 1 byte:
Tell OS to read from the file.
OS searches to find the file location.
OS tells disk drive to power up.
OS waits for disk drive to get up to speed.
OS tells disk drive to position to the correct track and sector.
-->OS tells disk to read one byte and put into drive buffer.
OS fetches data from drive buffer.
Disk spins down to a stop.
OS returns 1 byte to your program.
In your program design, the above steps will be repeated 256 times. With everybody's suggestion, the line marked with "-->" will read 256 bytes. Thus the overhead is executed only once instead of 256 times to get the same quantity of data.
In your code you are trying to read 256 bytes to the address of one int. If you want to read one byte at a time, call fread(&buffer, 1, 1, file); (See fread).
But a simpler solution will be to declare an array of bytes, read it all together and process it after that.
unsigned char buffer; // note: 1 byte
fread(&buffer, 1, 1, file);
It is time to read mans I believe.
Couple of problems with the code as it stands.
The prototype for fread is:
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
You've set the size to 256 (bytes) and the count to 1. That's fine, that means "read one lump of 256 bytes, shove it into the buffer".
However, your buffer is on the order of 2-8 bytes long (or, at least, vastly smaller than 256 bytes), so you have a buffer overrun. You probably want to use fred(&buffer, 1, 1, file).
Furthermore, you're writing byte data to an int pointer. This will work on one endian-ness (small-endian, in fact), so you'll be fine on Intel architecture and from that learn bad habits tha WILL come back and bite you, one of these days.
Try real hard to only write byte data into byte-organised storage, rather than into ints or floats.
You are trying to read 256 bytes into a 4-byte integer variable called "buffer". You are overwriting the next 252 bytes of other data.
It seems like buffer should either be unsigned char buffer[256]; or you should be doing fread(&buffer, 1, 1, f) and in that case buffer should be unsigned char buffer;.
Alternatively, if you just want a single character, you could just leave buffer as int (unsigned is not needed because C99 guarantees a reasonable minimum range for plain int) and simply say:
buffer = fgetc(f);

Why does fwrite have both size and count parameters when just bytes to write would suffice? [duplicate]

We had a discussion here at work regarding why fread() and fwrite() take a size per member and count and return the number of members read/written rather than just taking a buffer and size. The only use for it we could come up with is if you want to read/write an array of structures which aren't evenly divisible by the platform alignment and hence have been padded but that can't be so common as to warrant this choice in design.
From fread(3):
The function fread() reads nmemb elements of data, each size bytes long,
from the stream pointed to by stream, storing them at the location given
by ptr.
The function fwrite() writes nmemb elements of data, each size bytes
long, to the stream pointed to by stream, obtaining them from the location
given by ptr.
fread() and fwrite() return the number of items successfully read or written
(i.e., not the number of characters). If an error occurs, or the
end-of-file is reached, the return value is a short item count (or zero).
The difference in fread(buf, 1000, 1, stream) and fread(buf, 1, 1000, stream) is, that in the first case you get only one chunk of 1000 bytes or nothing, if the file is smaller and in the second case you get everything in the file less than and up to 1000 bytes.
It's based on how fread is implemented.
The Single UNIX Specification says
For each object, size calls shall be
made to the fgetc() function and the
results stored, in the order read, in
an array of unsigned char exactly
overlaying the object.
fgetc also has this note:
Since fgetc() operates on bytes,
reading a character consisting of
multiple bytes (or "a multi-byte
character") may require multiple calls
to fgetc().
Of course, this predates fancy variable-byte character encodings like UTF-8.
The SUS notes that this is actually taken from the ISO C documents.
This is pure speculations, however back in the days(Some are still around) many filesystems were not simple byte streams on a hard drive.
Many file systems were record based, thus to satisfy such filesystems in an efficient manner, you'll have to specify the number of items ("records"), allowing fwrite/fread to operate on the storage as records, not just byte streams.
Here, let me fix those functions:
size_t fread_buf( void* ptr, size_t size, FILE* stream)
{
return fread( ptr, 1, size, stream);
}
size_t fwrite_buf( void const* ptr, size_t size, FILE* stream)
{
return fwrite( ptr, 1, size, stream);
}
As for a rationale for the parameters to fread()/fwrite(), I've lost my copy of K&R long ago so I can only guess. I think that a likely answer is that Kernighan and Ritchie may have simply thought that performing binary I/O would be most naturally done on arrays of objects. Also, they may have thought that block I/O would be faster/easier to implement or whatever on some architectures.
Even though the C standard specifies that fread() and fwrite() be implemented in terms of fgetc() and fputc(), remember that the standard came into existence long after C was defined by K&R and that things specified in the standard might not have been in the original designers ideas. It's even possible that things said in K&R's "The C Programming Language" might not be the same as when the language was first being designed.
Finally, here's what P.J. Plauger has to say about fread() in "The Standard C Library":
If the size (second) argument is greater than one, you cannot determine
whether the function also read up to size - 1 additional characters beyond what it reports.
As a rule, you are better off calling the function as fread(buf, 1, size * n, stream); instead of
fread(buf, size, n, stream);
Bascially, he's saying that fread()'s interface is broken. For fwrite() he notes that, "Write errors are generally rare, so this is not a major shortcoming" - a statement I wouldn't agree with.
Likely it goes back to the way that file I/O was implemented. (back in the day) It might have been faster to write / read to files in blocks then to write everything at once.
Having separate arguments for size and count could be advantageous on an implementation that can avoid reading any partial records. If one were to use single-byte reads from something like a pipe, even if one was using fixed-format data, one would have to allow for the possibility of a record getting split over two reads. If could instead requests e.g. a non-blocking read of up to 40 records of 10 bytes each when there are 293 bytes available, and have the system return 290 bytes (29 whole records) while leaving 3 bytes ready for the next read, that would be much more convenient.
I don't know to what extent implementations of fread can handle such semantics, but they could certainly be handy on implementations that could promise to support them.
I think it is because C lacks function overloading. If there was some, size would be redundant. But in C you can't determine a size of an array element, you have to specify one.
Consider this:
int intArray[10];
fwrite(intArray, sizeof(int), 10, fd);
If fwrite accepted number of bytes, you could write the following:
int intArray[10];
fwrite(intArray, sizeof(int)*10, fd);
But it is just inefficient. You will have sizeof(int) times more system calls.
Another point that should be taked into consideration is that you usually don't want a part of an array element be written to a file. You want the whole integer or nothing. fwrite returns a number of elements succesfully written. So if you discover that only 2 low bytes of an element is written what would you do?
On some systems (due to alignment) you can't access one byte of an integer without creating a copy and shifting.

Resources