Opening big files with fopen [duplicate]

Opening big files with fopen [duplicate] - c

On a 32-bit system, what does ftell return if the current position indicator of a file opened in binary mode is past the 2GB point? In the C99 standard, is this undefined behavior since ftell must return a long int (maximum value being 2**31-1)?

on long int
long int is supposed to be AT LEAST 32-bits, but C99 standard does NOT limit it to 32-bit.
C99 standard does provide convenience types like int16_t & int32_t etc that map to correct bit sizes for a target platform.
on ftell/fseek
ftell() and fseek() are limited to 32 bits (including sign bit) on the vast majority of 32-bit architecture systems. So when there is large file support you run into this 2GB issue.
POSIX.1-2001 and SysV functions for fseek and ftell are fseeko and ftello because they use off_t as the parameter for the offset.
you do need to define compile with -D_FILE_OFFSET_BITS=64 or define it somewhere before including stdio.h to ensure that off_t is 64-bits.
Read about this at the cert.org secure coding guide.
On confusion about ftell and size of long int
C99 says long int must be at least 32-bits it does NOT say that it cannot be bigger
try the following on x86_64 architecture:
#include <stdio.h>
int main(int argc, char *argv[]) {
FILE *fp;
fp = fopen( "test.out", "w");
if ( !fp )
return -1;
fseek(fp, (1L << 34), SEEK_SET);
fprintf(fp, "\nhello world\n");
fclose(fp);
return 0;
}
Notice that 1L is just a long, this will produce a file that's 17GB and sticks a "\nhello world\n" to the end of it. Which you can verify is there by trivially using tail -n1 test.out or explicitly using:
dd if=test.out skip=$((1 << 25))
Note that dd typically uses block size of (1 << 9) so 34 - 9 = 25 will dump out '\nhello world\n'

At least on a 32bit OS ftell() it will overflow or error or simply run into Undefined Behaviour.
To get around this you might like to use off_t ftello(FILE *stream); and #define _FILE_OFFSET_BITS 64.
Verbatim from man ftello:
The fseeko() and ftello() functions are identical to fseek(3) and ftell(3) (see fseek(3)), respectively, except that the offset argument of fseeko() and the return value of ftello() is of type off_t instead of long.
On many architectures both off_t and long are 32-bit types, but compilation with
#define _FILE_OFFSET_BITS 64
will turn off_t into a 64-bit type.
Update:
According to IEEE Std 1003.1, 2013 Edition ftell() shall return -1 and set errno to EOVERFLOW in such cases:
EOVERFLOW
For ftell(), the current file offset cannot be represented correctly in an object of type long.

There is no 64b aware method in C99 standard. What OS/environment are you using? On windows, there is _ftelli64.
On other platforms, look at http://forums.codeguru.com/showthread.php?277234-Cannot-use-fopen()-open-file-larger-than-4-GB

This worked for me on Windows32/MinGW to play with a 6GB file
#define _FILE_OFFSET_BITS 64
#include<stdio.h>
int main() {
FILE *f = fopen("largefile.zip","rb");
fseeko64(f, 0, SEEK_END);
off64_t size = ftello64(f);
printf("%llu\n", size);
}
gcc readlargefile.c -c -std=C99 -o readlargefile.exe
Every single detail, the macro, the compiler option, matters.

Related

what does `l` in `lseek` of unistd.h mean?

I am reading APUE to explore the details of C and Unix, and encounter lseek
NAME
lseek - move the read/write file offset
SYNOPSIS
#include <unistd.h>
off_t lseek(int fildes, off_t offset, int whence);
What does l mean, is it length?

l is for long integer.
It is named like that to differentiate from the old seek() in version 2 of AT&T Unix. This is an anachronism before the off_t type was introduced.
References:
Infohost indicates:
The character l in the name lseek means "long integer". Before the
introduction of the off_t data type, the offset argument and the
return value were long integers. lseek was introduced with Version 7
when long integers were added to C. (Similar functionality was
provided in Version 6 by the functions seek and tell.)
As noted at the foot of lseek.html:
A seek() function appeared in Version 2 AT&T UNIX, later renamed into
lseek() for ``long seek'' due to a larger offset argument type.
Note: Paraphrased from Why is the function called lseek(), not seek()?

fseek to a 32-bit unsigned offset

I am reading a file format (TIFF) that has 32-bit unsigned offsets from the beginning of the file.
Unfortunately the prototype for fseek, the usual way I would go to particular file offset, is:
int fseek ( FILE * stream, long int offset, int origin );
so the offset is signed. How should I handle this situation? Should I be using a different function for seeking?

After studying this question more deeply and considering the other comments and answers (thank you), I think the simplest approach is to do two seeks if the offset is greater than 2147483647 bytes. This allows me to keep the offsets as uint32_t and continue using fseek. The positioning code is therefore like this:
// note: error handling code omitted
uint32_t offset = ... (whatever it is)
if( offset > 2147483647 ){
fseek( file, 2147483647, SEEK_SET );
fseek( file, (long int)( offset - 2147483647 ), SEEK_CUR );
} else {
fseek( file, (long int) offset, SEEK_SET );
}
The problem with using 64-bit types is that the code might be running on a 32-bit architecture (among other things). There is a function fsetpos which uses a structure fpos_t to manage arbitrarily large offsets, but that brings with it a range of complexities. Although fsetpos might make sense if I was truly using offsets of arbitrarily large size, since I know the largest possible offset is uint32_t, then the double seek meets that need.
Note that this solution allows all TIFF files to be handled on a 32-bit system. The advantage of this is obvious if you consider commercial programs like PixInsight. PixInsight can only handle TIFF files smaller than 2147483648 bytes when running on 32-bit systems. To handle full sized TIFF files, a user has to use the 64-bit version of PixInsight on a 64-bit computer. This is probably because the PixInsight programmers used a 64-bit type to handle the offsets internally. Since my solution only uses 32-bit types, I can handle full-sized TIFF files on a 32-bit system (as long as the underlying operating system can handle files that large).

You can try to use lseek64() (man page)
#define _LARGEFILE64_SOURCE /* See feature_test_macros(7) */
#include <sys/types.h>
#include <unistd.h>
off64_t lseek64(int fd, off64_t offset, int whence);
With
int fd = fileno (stream);
Notes from The GNU C lib - Setting the File Position of a Descriptor
This function is similar to the lseek function. The difference is that the offset parameter is of type off64_t instead of off_t which makes it possible on 32 bit machines to address files larger than 2^31 bytes and up to 2^63 bytes. The file descriptor filedes must be opened using open64 since otherwise the large offsets possible with off64_t will lead to errors with a descriptor in small file mode.
When the source file is compiled with _FILE_OFFSET_BITS == 64 on a 32 bits machine this function is actually available under the name lseek and so transparently replaces the 32 bit interface.
About fd and stream, from Streams and File Descriptors
Since streams are implemented in terms of file descriptors, you can extract the file descriptor from a stream and perform low-level operations directly on the file descriptor. You can also initially open a connection as a file descriptor and then make a stream associated with that file descriptor.

C pread gives different results

I have two systems. One is Ubuntu 14.04 64bit on a Intel CPU, the other one is Ubuntu 14.04 for ARM on a CubieTruck.
The Intel system has a data file stored on a ext4 formatted HDD. The CubieTruck has the same file on a NTFS HDD, which is mounted with NTFS-3G.
I currently have a problem with pread() on those systems. I read a bunch of bytes from a file, and print out the first 64 byte from this chunk. Later, these bytes are used to calculate some hash using Shabal.
While the data printed on the CubieTruck matches exactly what I see on a Windows system when opening the file with a Hex-editor, the output on the 64bit Ubuntu is different. It looks like it is filled with "FFFFFF", but also different in general. Even more strange is, that while output on the CubieTruck always stays the same, it changes on the 64bit Ubuntu system after a while (I haven't seen a pattern when that happens, I just check from time to time).
But the most annoying thing is, that the x64 system seems to calculate correctly, while the ARM system is wrong.
I have no idea why pread delivers different results for the same file under those systems, but I hope someone can shed some light into it.
edit, the code:
int main(int argc, char **argv) {
unsigned int readsize = 16384 * 32 * 2;
char *cache = (char*) malloc(readsize);
int fh = open("/home/user/somefile", O_RDONLY);
if (fh < 0) {
printf("can't open file");
exit(-1);
}
int bytes = 0, b;
do {
b = pread(fh, &cache[bytes], readsize - bytes, bytes);
bytes += b;
} while(bytes < readsize && b > 0);
int i = 0;
for (i=0; i < 64; i++) {
printf("%02X", cache[i]);
}
close(fh);
free(cache);
return 0;
}
both systems are opening the exact same file.
result on x64:
FFFFFF94FFFFFFF16D25FFFFFFC0FFFFFFA3367D010BFFFFFFEF1E12FFFFFF841CFFFFFFBE4C26FFFFFF92FFFFFF80FFFFFF86FFFFFFA822FFFFFF8A26FFFFFF906CFFFFFFAD05FFFFFFE7FFFFFFB124FFFFFFA8FFFFFFF77B16FFFFFFEAFFFFFFACFFFFFF9DFFFFFF9EFFFFFF81FFFFFFC7FFFFFF92FFFFFFCDFFFFFFB0FFFFFFE86270FFFFFFF974FFFFFFA8420C45FFFFFFFC04FFFFFFF9103F2E3A47FFFFFF990F
result on ARM:
94F16D25C0A3367D010BEF1E12841CBE4C26928086A8228A26906CAD05E7B124A8F77B16EAAC9D9E81C792CDB0E86270F974A8420C45FC04F9103F2E3A47990F
You can see, on x64, the result is filled with "FFFFFF", and it appears, that this is somehow needed later on. But I don't get why it's different on my systems.

One of the fun aspects of C is the amount of implementation-defined behaviour - in this case, whether char is signed or not.
The %x format specifier takes an unsigned int argument, so in the ARM case is the conversion is straightforward - char is unsigned so just gets zero-extended to unsigned int. However for x86 where it's signed, the converion can go one of two ways:
sign-extend the char to a signed int, then cast it to unsigned int
first cast to unsigned char, then zero-extend to unisgned int
It appears the char->int part of the conversion takes precedence over the signed->unsigned part* so you get the former (note how the bytes without the top bit set are unambiguous and print the same on both implementations). I imagine your calculation does a similar conversion somewhere expecting signedness, hence why it breaks on ARM.
In short, if you're dealing with char-sized values rather than characters, always specify signed char or unsigned char as appropriate, never bare char.
* I suppose I could dig out the standard to check if that's actually specified, but at this point it's merely a trivial detail

ftell at a position past 2GB

On a 32-bit system, what does ftell return if the current position indicator of a file opened in binary mode is past the 2GB point? In the C99 standard, is this undefined behavior since ftell must return a long int (maximum value being 2**31-1)?

on long int
long int is supposed to be AT LEAST 32-bits, but C99 standard does NOT limit it to 32-bit.
C99 standard does provide convenience types like int16_t & int32_t etc that map to correct bit sizes for a target platform.
on ftell/fseek
ftell() and fseek() are limited to 32 bits (including sign bit) on the vast majority of 32-bit architecture systems. So when there is large file support you run into this 2GB issue.
POSIX.1-2001 and SysV functions for fseek and ftell are fseeko and ftello because they use off_t as the parameter for the offset.
you do need to define compile with -D_FILE_OFFSET_BITS=64 or define it somewhere before including stdio.h to ensure that off_t is 64-bits.
Read about this at the cert.org secure coding guide.
On confusion about ftell and size of long int
C99 says long int must be at least 32-bits it does NOT say that it cannot be bigger
try the following on x86_64 architecture:
#include <stdio.h>
int main(int argc, char *argv[]) {
FILE *fp;
fp = fopen( "test.out", "w");
if ( !fp )
return -1;
fseek(fp, (1L << 34), SEEK_SET);
fprintf(fp, "\nhello world\n");
fclose(fp);
return 0;
}
Notice that 1L is just a long, this will produce a file that's 17GB and sticks a "\nhello world\n" to the end of it. Which you can verify is there by trivially using tail -n1 test.out or explicitly using:
dd if=test.out skip=$((1 << 25))
Note that dd typically uses block size of (1 << 9) so 34 - 9 = 25 will dump out '\nhello world\n'

At least on a 32bit OS ftell() it will overflow or error or simply run into Undefined Behaviour.
To get around this you might like to use off_t ftello(FILE *stream); and #define _FILE_OFFSET_BITS 64.
Verbatim from man ftello:
The fseeko() and ftello() functions are identical to fseek(3) and ftell(3) (see fseek(3)), respectively, except that the offset argument of fseeko() and the return value of ftello() is of type off_t instead of long.
On many architectures both off_t and long are 32-bit types, but compilation with
#define _FILE_OFFSET_BITS 64
will turn off_t into a 64-bit type.
Update:
According to IEEE Std 1003.1, 2013 Edition ftell() shall return -1 and set errno to EOVERFLOW in such cases:
EOVERFLOW
For ftell(), the current file offset cannot be represented correctly in an object of type long.

There is no 64b aware method in C99 standard. What OS/environment are you using? On windows, there is _ftelli64.
On other platforms, look at http://forums.codeguru.com/showthread.php?277234-Cannot-use-fopen()-open-file-larger-than-4-GB

This worked for me on Windows32/MinGW to play with a 6GB file
#define _FILE_OFFSET_BITS 64
#include<stdio.h>
int main() {
FILE *f = fopen("largefile.zip","rb");
fseeko64(f, 0, SEEK_END);
off64_t size = ftello64(f);
printf("%llu\n", size);
}
gcc readlargefile.c -c -std=C99 -o readlargefile.exe
Every single detail, the macro, the compiler option, matters.

printf with sizeof on 32 vs 64 platforms: how do I handle format code in platform independent manner?

I have some code that prints the amount of memory used by the program. The line is similar to this:
printf("The about of RAM used is %u", anIntVariable*sizeof(double) );
where anIntVariable is an int variable for the number of elements of the double array. Anyhow, on 32-bit systems I never had any problems but on 64-bit systems, I get a compiler warning about using "%u" for a unsigned long integer. Using "%lu" as the format code fixes the problem on 64-bit but causes the compiler to complain on 32-bit because the type is back to unsigned int. I've found that, indeed, sizeof(double) returns a different value on 32 vs 64 bit systems. I've found some webpage guides to convert code from 32 bit to 64 bit But I'd rather have code that works on both instead of just converting back and forth.
How do I write this line in a platform independent way? I know many ways I could do it using preprocessor directives but that seems like a hack. Surely there's an elegant way that I'm not realizing.

Portable printf identifiers are provided in the include file inttypes.h or here.
This include file has many portable identifiers for your specific runtime. For your example, you want PRIuPTR, which means "PRintf Identifier unsigned with size of up to a pointer's size".
Your example will then be:
printf("The amount of RAM used is %" PRIuPTR, anIntVariable*sizeof(double) );
Results on 64bit Linux with GCC 4.3 (int anIntVariable = 1):
$ gcc test.c -m32 -o test && ./test
The amount of RAM used is 8
$ gcc test.c -o test && ./test
The amount of RAM used is 8
For completeness sake, there are identifiers for scanf too, whose prefixes are SCN.

The return value of sizeof is a size_t. If you're using a C99 compliant compiler it looks like you can use %zd%zu for this.
D'oh: %zu (unsigned) of course. Thanks, ony.

First of all, you should match the "%" specifier with the actual data type you want to print. sizeof returns the data type size_t, and just as you shouldn't try to print a float using a "%d" specifier, you shouldn't try to print a size_t with "%u" or "%d" or anything that doesn't really mean size_t.
The other replies have given some good ways to handle this with newer compilers ("%z" and PRIu32), but the way we used to do this was simply to cast the size_t to unsigned long, and then print it using "%lu":
printf("The amount of RAM used is %lu", (unsigned long)(anIntVariable*sizeof(double)) );
This will not work on systems where size_t is wider than a long, but I don't know of any such systems, and I'm not even sure if the standard allows it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Opening big files with fopen [duplicate] - c

On a 32-bit system, what does ftell return if the current position indicator of a file opened in binary mode is past the 2GB point? In the C99 standard, is this undefined behavior since ftell must return a long int (maximum value being 2**31-1)?

There is no 64b aware method in C99 standard. What OS/environment are you using? On windows, there is _ftelli64. On other platforms, look at http://forums.codeguru.com/showthread.php?277234-Cannot-use-fopen()-open-file-larger-than-4-GB

Related

what does `l` in `lseek` of unistd.h mean?

fseek to a 32-bit unsigned offset

C pread gives different results

ftell at a position past 2GB

printf with sizeof on 32 vs 64 platforms: how do I handle format code in platform independent manner?

Categories

Resources