Is there a limitation on blocksize for reading when using fread in C? - c

I am currently programming an application for smartphones using C++ and the NDK. For reading external files, I use fread. This works well on Windows, however, on Android phones, I got a mess with my implementation of the deflate decompressor. Of course I thought, there something wrong with my implementation of deflate, but it didn't really make sense as everything worked perfectly on Windows machines. After hours, I was finally able to track down the problem to fread.
I am reading a file of size 4790954 and the return value of fread is also 4790954. I, however discovered, that the buffer starts to contain trash at offset: 4194304. Exactly 4MB. Is there any known limitation on blocksize to be read at once that are defined in ANSI C I am not aware of?? Also, isn't that considered a bug, if the Google NDK fread function returns an amount of read bytes of 4790954, if it however, only read 4194304 bytes (4MB)?

Is there a limitation on blocksize for reading when using fread in C?
Not per se. The limitation is implied based on the data types. Android is 32-bit only, so size_t is 32-bits. There's also a potential for wrap leading to a smaller read size because of object_size * number_of_objects (these are unsigned values, so they wrap rather than overflow).
From The Open Group Base Specifications Issue 6 and fread:
size_t fread(void *restrict ptr, size_t size, size_t nitems,
FILE *restrict stream);
And the description:
The fread() function shall read into the array pointed to by ptr up to
nitems elements whose size is specified by size in bytes, from the
stream pointed to by stream. For each object, size calls shall be made
to the fgetc() function and the results stored, in the order read, in
an array of unsigned char exactly overlaying the object.
What does ferror(fp) return after the read? Is there an error?
Related: you might want to have a quick look at the answer of Using fread properly. I'm not claiming there's a problem in your usage fo fread or fgetc. But there's no code, so we can't tell.

I know its quite some time ago but looking around my profile I realized that this question remained open even though I found a solution to the problem a long time aog. The error was rather stupid - I didn't open the file in binary mode using fopen, which led to this strange behaviour...

Related

Why can't linux write more than 2147479552 bytes?

In man 2 write the NOTES section contains the following note:
On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)
Why is that?
The DESCRIPTION path has the following sentence:
According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined
SSIZE_MAX is way bigger than 0x7ffff000. Why is this note there?
Update: Thanks for the answer! In case anyone is interested (and for better SEO to help developers out here), all functions with that limititations are:
read
write
sendfile
To find this out one just has to full text search the manual:
% man -wK "0x7ffff000"
/usr/share/man/man2/write.2.gz
/usr/share/man/man2/read.2.gz
/usr/share/man/man2/sendfile.2.gz
/usr/share/man/man2/sendfile.2.gz
Why is this here?
I don't think there's necessarily a good reason for this - I think this is basically a historical artifact. Let me explain with some git archeology.
In current Linux, this limit is governed by MAX_RW_COUNT:
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
[...]
if (count > MAX_RW_COUNT)
count = MAX_RW_COUNT;
That constant is defined as the AND of the integer max value and the page mask. This is roughly equal to the max integer size minus the size of one page.
#define MAX_RW_COUNT (INT_MAX & PAGE_MASK)
So that's where 0x7ffff000 comes from - your platform has pages which are 4096 bytes wide, which is 212, so it's the max integer value with the bottom 12 bits unset.
The last commit to change this, ignoring commits which just move things around, was e28cc71572da3.
Author: Linus Torvalds <torvalds#g5.osdl.org>
Date: Wed Jan 4 16:20:40 2006 -0800
Relax the rw_verify_area() error checking.
In particular, allow over-large read- or write-requests to be downgraded
to a more reasonable range, rather than considering them outright errors.
We want to protect lower layers from (the sadly all too common) overflow
conditions, but prefer to do so by chopping the requests up, rather than
just refusing them outright.
So, this gives us a reason for the change: to prevent integer overflow, the size of the write is capped at a size near the maximum integer. Most of the surrounding logic seems to have been changed to use longs or size_t's, but the check remains.
Before this change, giving it a buffer larger than INT_MAX would result in an EINVAL error:
if (unlikely(count > INT_MAX))
goto Einval;
As for why this limit was put in place, it existed prior to 2.6.12, the first version that was put into git. I'll let someone with more patience than me figure that one out. :)
Is this POSIX compliant?
Putting on my standards lawyer hat, I think this is actually POSIX compliant. Yes, POSIX does say that writes larger than SSIZE_MAX are implementation-defined behavior, and this is not larger than that limit. However, there are two other sentences in the standard which I think are important:
The write() function shall attempt to write nbyte bytes from the buffer pointed to by buf to the file associated with the open file descriptor, fildes.
[...]
Upon successful completion, write() and pwrite() shall return the number of bytes actually written to the file associated with fildes. This number shall never be greater than nbyte. Otherwise, -1 shall be returned and errno set to indicate the error.
The partial write is explicitly allowed by the standard. For this reason, all code which calls write() needs to wrap calls to write() in a loop which retries short writes.
Should the limit be raised?
Ignoring the historical baggage, and the standard, is there a reason to raise this limit today?
I'd argue the answer is no. The optimal size of the write() buffer is a tradeoff between trying to avoid excessive context switches between kernel and userspace, and ensuring your data fits into cache as much as possible.
The coreutils programs (which provide cat, cp, etc) use a buffer size of 128KiB. The optimal size for your hardware might be slightly larger or smaller. But it's unlikely that 2GB buffers are going to be faster.

C How fread reads different data blocks in a binary file?

I'm porting some C code to C#, but I know little about C, but I'm flexible and I can learn new programming languages. Anyway I wasn't able to figure out the exact behaviour from the code I'm porting.
I've read about fread() and on the web.
fread(&(targetObj->data), sizeof(TestObj), 1, file);
Now, file is a big binary file with lots of data in it.
What I want to know is how I can do this in C#.
Let me explain:
I think that line of code does this:
TestObj is an unsigned short
reads 1 time a chunk of data of the size of TestObj(unsigned short)
reads it from file (which is pointer to a binary file on filesystem) into targetObj->data
What I don't understand is:
I have a big binary file, what it actually reads? There are somewhere headers which define where an unsigned short sized chunk of data is written?
Where does it takes from the binary that object? How can I know how to read back from the binary file in C#? Maybe C knows where to pick that single unsigned short, but I don't in C#
For example if that binary file has saved in it 40 unsigned shorts the C code line above reads just the first one?
and if I do
fread(&(targetObj->data), sizeof(TestObj), 5, file);
it is expected that testObj->data is an array of 5 unsigned shorts?
And the code will read the first 5 unsigned shorts that it finds in the whole binary file?
I can't wrap my head around this but I need to know how C recognizes that unsigned short in a big binary file which I don't know the content of nor I can't think how I can say in C# read the first C unsigned short from that file
fread just reads the specified number of bytes from the current file cursor position, and advances the file cursor (or "file pointer", but not to be confused with a C pointer).
So if sizeof(TestObj) is 2, it will read two bytes and place them into the location pointed by &(targetObj->data), with no bounds checking, and regardless of any differences between your architecture endianess and the file protocol endianess. Note that this approach is not a platform-independent way of parsing files containing numbers in binary form, since the number might be stored differently on your machine, compared to how it is stored inside the file (by whoever designed the binary protocol you are trying to read).
In C#, you might achieve a similar thing by manually specifying struct packing and field placement, although the code will suffer from the same problems as your C code.
fread reads from current position in stream see also ftell and fseek. Equivalent in C# would be Stream.Read
From man fread
size_t
fread(void *restrict ptr, size_t size, size_t nitems, FILE *restrict stream);
The function fread() reads nitems objects, each size bytes long, from the stream pointed to by stream, storing them at the location given by ptr.
sizeof(short) is resolved by compiler, as per https://stackoverflow.com/a/14171152/6204612
And C does not do any pretty conversions from you. What is read is precisely sizeof(short) bytes, and these bytes are put into TestObj variable. Whether it is correct or not is implementer's responsibility. You need to manage offsets, collection sizes etc. on your own.

Reading a binary file bit by bit

I know the function below:
size_t fread(void *ptr, size_t size_of_elements, size_t number_of_elements, FILE *a_file);
It only reads byte by byte, my goal is to be able to read 12 bits at a time and then take them into an array. Any help or pointers would be greatly appreciated!
Adding to the first comment, you can try reading one byte at a time (declare a char variable and write there), and then use the bitwise operators >> and << to read bit by bit. Read more here: http://www.cprogramming.com/tutorial/bitwise_operators.html
Many years ago, I wrote some I/O routines in C for a Huffman encoder. This needs to be able to read and write on the granularity of bits rather than bytes. I created functions analogous to read(2) and write(2) that could be asked to (say) read 13 bits from a stream. To encode, for example, bytes would be fed into the coder and variable numbers of bits would emerge the other side. I had a simple structure with a bit pointer into the current byte being read or written. Every time it went off the end, it flushed the completed byte out and reset the pointer to zero. Unfortunately that code is long gone, but it might be an idea to pull apart an open-source Huffman coder and see how the problem was solved there. Similarly, base64 coding takes 3 bytes of data and turns them into 4 (or vice versa).
I've implemented a couple of methods to read/write files bit by bit. Here they are. Whether it is viable or not for your use case, you have to decide that for yourself. I've tried to make the most readable, optimized code i could, not being a seasoned C developer (for now).
Internally, it uses a "bitCursor" to store information about previous bits that don't yet fit a full byte. It has who data fields: d stores data and s stores size, or the amount of bits stored in the cursor.
You have four functions:
newBitCursor(), which returns a bitCursor object with default values
{0,0}. Such a cursor is needed at the beginning of a sequence of
read/write operations to or from a file.
fwriteb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which writes sizeb rightmost bits of the value stored in ptr to fp.
fcloseb(FILE *fp, bitCursor *c), which writes a remaining byte, if
the previous writes did not exactly encapsulate all data needed to
be written, that is probably almost always the case...
freadb(void *ptr, size_t sizeb, size_t rSizeb, FILE *fp, bitCursor
*c), which reads sizeb bits and bitwise ORs them to *ptr. (it is, therefore, your responsibility to init *ptr as 0)
More info is provided in the comments. Have Fun!
Edit: It has come to my knowledge today that when i made that i assumed Little Endian! :P Oops! It's always nice to realize how much of a noob i still am ;D.
Edit: GNU's Binary File Descriptors.
Read the first two bytes from your a_file file pointer and check the bits in the least or greatest byte — depending on the endianness of your platform (x86 is little-endian) — using bitshift operators.
You can't really put bits into an array, as there isn't a datatype for bits. Rather than keeping 1's and 0's in an array, which is inefficient, it seems cheaper just to keep the two bytes in a two-element array (say, of type unsigned char *) and write functions to map those two bytes to one of 4096 (2^12) values-of-interest.
As a complication, on subsequent reads, if you want to fread through the pointer every 12 bits, you would read only one byte, using the left-over bits from the previous read to build a new 12-bit value. If there are no leftovers, you would need to read two bytes.
Your mapping functions would need to address the second case where bits were used from previous read, because the two bytes would need different mapping. To do this efficiently, a modulus on a read-counter could be used to swap between two mappings.
read 2 bytes and do bit wise operations will get it done for the next time read 2nd bytes onwards apply the bit-wise operations will get back you expected . . . .
For your problem you can see this demo program which read 2byte but actual information is only 12 bit.As well as this type of things are used it bit wise access.
fwrite() is a standard library function which take the size argument as byte and of type int.So it is not possible exactly 12bit read.If the file you create then create like below as well as read as below it solve your problem.
If that file is special file which not written by you then follow the standard provided for that file to read I think they also writing like this only.Or you can provide the axact where it I may help you.
#include<stdio.h>
#include<stdlib.h>
struct node
{
int data:12;
}NODE;
int main()
{
FILE *fp;
fp=fopen("t","w");
NODE.data=1024;
printf("%d\n",NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
NODE.data=0;
NODE.data=2048;
printf("%d\n",(unsigned)NODE.data);
fwrite(&NODE,sizeof(NODE),1,fp);
fclose(fp);
fp=fopen("t","r");
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fread(&NODE,sizeof(NODE),1,fp);
printf("%d\n",NODE.data);
fclose(fp);
}

Using fseek and ftell to determine the size of a file has a vulnerability?

I've read posts that show how to use fseek and ftell to determine the size of a file.
FILE *fp;
long file_size;
char *buffer;
fp = fopen("foo.bin", "r");
if (NULL == fp) {
/* Handle Error */
}
if (fseek(fp, 0 , SEEK_END) != 0) {
/* Handle Error */
}
file_size = ftell(fp);
buffer = (char*)malloc(file_size);
if (NULL == buffer){
/* handle error */
}
I was about to use this technique but then I ran into this link that describes a potential vulnerability.
The link recommends using fstat instead. Can anyone comment on this?
The link is one of the many nonsensical pieces of C coding advice from CERT. Their justification is based on liberties the C standard allows an implementation to take, but which are not allowed by POSIX and thus irrelevant in all cases where you have fstat as an alternative.
POSIX requires:
that the "b" modifier for fopen have no effect, i.e. that text and binary mode behave identically. This means their concern about invoking UB on text files is nonsense.
that files have a byte-resolution size set by write operations and truncate operations. This means their concern about random numbers of null bytes at the end of the file is nonsense.
Sadly with all the nonsense like this they publish, it's hard to know which CERT publications to take seriously. Which is a shame, because lots of them are serious.
If your goal is to find the size of a file, definitely you should use fstat() or its friends. It's a much more direct and expressive method--you are literally asking the system to tell you the file's statistics, rather than the more roundabout fseek/ftell method.
A bonus tip: if you only want to know if the file is available, use access() rather than opening the file or even stat'ing it. This is an even simpler operation which many programmers aren't aware of.
The reason to not use fstat is that fstat is POSIX, but fopen, ftell and fseek are part of the C Standard.
There may be a system that implements the C Standard but not POSIX. On such a system fstat would not work at all.
I'd tend to agree with their basic conclusion that you generally shouldn't use the fseek/ftell code directly in the mainstream of your code -- but you probably shouldn't use fstat either. If you want the size of a file, most of your code should use something with a clear, direct name like filesize.
Now, it probably is better to implement that using fstat where available, and (for example) FindFirstFile on Windows (the most obvious platform where fstat usually won't be available).
The other side of the story is that many (most?) of the limitations on fseek with respect to binary files actually originated with CP/M, which didn't explicitly store the size of a file anywhere. The end of a text file was signaled by a control-Z. For a binary file, however, all you really knew was what sectors were used to store the file. In the last sector, you had some amount of unused data that was often (but not always) zero-filled. Unfortunately, there might be zeros that were significant, and/or non-zero values that weren't significant.
If the entire C standard had been written just before being approved (e.g., if it had been started in 1988 and finished in 1989) they'd probably have ignored CP/M completely. For better or worse, however, they started work on the C standard in something like 1982 or so, when CP/M was still in wide enough use that it couldn't be ignored. By the time CP/M was gone, many of the decisions had already been made and I doubt anybody wanted to revisit them.
For most people today, however, there's just no point -- most code won't port to CP/M without massive work; this is one of the relatively minor problems to deal with. Making a modern program run in only 48K (or so) of memory for both the code and data is a much more serious problem (having a maximum of a megabyte or so for mass storage would be another serious problem).
CERT does have one good point though: you probably should not (as is often done) find the size of a file, allocate that much space, and then assume the contents of the file will fit there. Even though the fseek/ftell will give you the correct size with modern systems, that data could be stale by the time you actually read the data, so you could overrun your buffer anyway.
According to C standard, §7.21.3:
Setting the file position indicator to end-of-file, as with fseek(file,
0, SEEK_END), has undefined behavior for a binary stream (because of
possible trailing null characters) or for any stream with
state-dependent encoding that does not assuredly end in the initial
shift state.
A letter-of-the-law kind of guy might think this UB can be avoided by calculating file size with:
fseek(file, -1, SEEK_END);
size = ftell(file) + 1;
But the C standard also says this:
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
As a result, there is nothing we can do to fix this with regard to fseek / SEEK_END. Still, I would prefer fseek / ftell instead of OS-specific API calls.

Does fread fail for large files?

I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.

Resources