I am reading a file format (TIFF) that has 32-bit unsigned offsets from the beginning of the file.
Unfortunately the prototype for fseek, the usual way I would go to particular file offset, is:
int fseek ( FILE * stream, long int offset, int origin );
so the offset is signed. How should I handle this situation? Should I be using a different function for seeking?
After studying this question more deeply and considering the other comments and answers (thank you), I think the simplest approach is to do two seeks if the offset is greater than 2147483647 bytes. This allows me to keep the offsets as uint32_t and continue using fseek. The positioning code is therefore like this:
// note: error handling code omitted
uint32_t offset = ... (whatever it is)
if( offset > 2147483647 ){
fseek( file, 2147483647, SEEK_SET );
fseek( file, (long int)( offset - 2147483647 ), SEEK_CUR );
} else {
fseek( file, (long int) offset, SEEK_SET );
}
The problem with using 64-bit types is that the code might be running on a 32-bit architecture (among other things). There is a function fsetpos which uses a structure fpos_t to manage arbitrarily large offsets, but that brings with it a range of complexities. Although fsetpos might make sense if I was truly using offsets of arbitrarily large size, since I know the largest possible offset is uint32_t, then the double seek meets that need.
Note that this solution allows all TIFF files to be handled on a 32-bit system. The advantage of this is obvious if you consider commercial programs like PixInsight. PixInsight can only handle TIFF files smaller than 2147483648 bytes when running on 32-bit systems. To handle full sized TIFF files, a user has to use the 64-bit version of PixInsight on a 64-bit computer. This is probably because the PixInsight programmers used a 64-bit type to handle the offsets internally. Since my solution only uses 32-bit types, I can handle full-sized TIFF files on a 32-bit system (as long as the underlying operating system can handle files that large).
You can try to use lseek64() (man page)
#define _LARGEFILE64_SOURCE /* See feature_test_macros(7) */
#include <sys/types.h>
#include <unistd.h>
off64_t lseek64(int fd, off64_t offset, int whence);
With
int fd = fileno (stream);
Notes from The GNU C lib - Setting the File Position of a Descriptor
This function is similar to the lseek function. The difference is that the offset parameter is of type off64_t instead of off_t which makes it possible on 32 bit machines to address files larger than 2^31 bytes and up to 2^63 bytes. The file descriptor filedes must be opened using open64 since otherwise the large offsets possible with off64_t will lead to errors with a descriptor in small file mode.
When the source file is compiled with _FILE_OFFSET_BITS == 64 on a 32 bits machine this function is actually available under the name lseek and so transparently replaces the 32 bit interface.
About fd and stream, from Streams and File Descriptors
Since streams are implemented in terms of file descriptors, you can extract the file descriptor from a stream and perform low-level operations directly on the file descriptor. You can also initially open a connection as a file descriptor and then make a stream associated with that file descriptor.
Related
I am reading APUE to explore the details of C and Unix, and encounter lseek
NAME
lseek - move the read/write file offset
SYNOPSIS
#include <unistd.h>
off_t lseek(int fildes, off_t offset, int whence);
What does l mean, is it length?
l is for long integer.
It is named like that to differentiate from the old seek() in version 2 of AT&T Unix. This is an anachronism before the off_t type was introduced.
References:
Infohost indicates:
The character l in the name lseek means "long integer". Before the
introduction of the off_t data type, the offset argument and the
return value were long integers. lseek was introduced with Version 7
when long integers were added to C. (Similar functionality was
provided in Version 6 by the functions seek and tell.)
As noted at the foot of lseek.html:
A seek() function appeared in Version 2 AT&T UNIX, later renamed into
lseek() for ``long seek'' due to a larger offset argument type.
Note: Paraphrased from Why is the function called lseek(), not seek()?
On a 32-bit system, what does ftell return if the current position indicator of a file opened in binary mode is past the 2GB point? In the C99 standard, is this undefined behavior since ftell must return a long int (maximum value being 2**31-1)?
on long int
long int is supposed to be AT LEAST 32-bits, but C99 standard does NOT limit it to 32-bit.
C99 standard does provide convenience types like int16_t & int32_t etc that map to correct bit sizes for a target platform.
on ftell/fseek
ftell() and fseek() are limited to 32 bits (including sign bit) on the vast majority of 32-bit architecture systems. So when there is large file support you run into this 2GB issue.
POSIX.1-2001 and SysV functions for fseek and ftell are fseeko and ftello because they use off_t as the parameter for the offset.
you do need to define compile with -D_FILE_OFFSET_BITS=64 or define it somewhere before including stdio.h to ensure that off_t is 64-bits.
Read about this at the cert.org secure coding guide.
On confusion about ftell and size of long int
C99 says long int must be at least 32-bits it does NOT say that it cannot be bigger
try the following on x86_64 architecture:
#include <stdio.h>
int main(int argc, char *argv[]) {
FILE *fp;
fp = fopen( "test.out", "w");
if ( !fp )
return -1;
fseek(fp, (1L << 34), SEEK_SET);
fprintf(fp, "\nhello world\n");
fclose(fp);
return 0;
}
Notice that 1L is just a long, this will produce a file that's 17GB and sticks a "\nhello world\n" to the end of it. Which you can verify is there by trivially using tail -n1 test.out or explicitly using:
dd if=test.out skip=$((1 << 25))
Note that dd typically uses block size of (1 << 9) so 34 - 9 = 25 will dump out '\nhello world\n'
At least on a 32bit OS ftell() it will overflow or error or simply run into Undefined Behaviour.
To get around this you might like to use off_t ftello(FILE *stream); and #define _FILE_OFFSET_BITS 64.
Verbatim from man ftello:
The fseeko() and ftello() functions are identical to fseek(3) and ftell(3) (see fseek(3)), respectively, except that the offset argument of fseeko() and the return value of ftello() is of type off_t instead of long.
On many architectures both off_t and long are 32-bit types, but compilation with
#define _FILE_OFFSET_BITS 64
will turn off_t into a 64-bit type.
Update:
According to IEEE Std 1003.1, 2013 Edition ftell() shall return -1 and set errno to EOVERFLOW in such cases:
EOVERFLOW
For ftell(), the current file offset cannot be represented correctly in an object of type long.
There is no 64b aware method in C99 standard. What OS/environment are you using? On windows, there is _ftelli64.
On other platforms, look at http://forums.codeguru.com/showthread.php?277234-Cannot-use-fopen()-open-file-larger-than-4-GB
This worked for me on Windows32/MinGW to play with a 6GB file
#define _FILE_OFFSET_BITS 64
#include<stdio.h>
int main() {
FILE *f = fopen("largefile.zip","rb");
fseeko64(f, 0, SEEK_END);
off64_t size = ftello64(f);
printf("%llu\n", size);
}
gcc readlargefile.c -c -std=C99 -o readlargefile.exe
Every single detail, the macro, the compiler option, matters.
Basically I have a file, and in this file I am writing 3 bytes, and then I'm writing a 4 byte integer. In another application I read the first 3 bytes, and then I read the next 4 bytes and convert them to an integer.
When I print out the value, I have very different results...
fwrite(&recordNum, 2, 1, file); //The first 2 bytes (recordNum is a short int)
fwrite(&charval, 1, 1, file); //charval is a single byte char
fwrite(&time, 4, 1, file);
// I continue writing a total of 40 bytes
Here is how time was calculated:
time_t rawtime;
struct tm * timeinfo;
time(&rawtime);
timeinfo = localtime(&rawtime);
int time = (int)rawtime;
I have tested to see that sizeof(time) is 4 bytes, and it is. I have also tested using an epoch converter to make sure this is the correct time (in seconds) and it is.
Now, in another file I read the 40 bytes to a char buffer:
char record[40];
fread(record, 1, 40, file);
// Then I convert those 4 bytes into an uint32_t
uint32_t timestamp =(uint32_t)record[6] | (uint32_t)record[5] << 8 | (uint32_t)record[4] << 16 | (uint32_t)record[3] << 24;
printf("Testing timestamp = %d\n", timestamp);
But this prints out -6624. The expected value is 551995007.
EDIT
To be clear, everything else that I am reading from the char buffer is correct. After this timestamp I have text, which I simply print and it runs fine.
You write the time at once with fwrite, which uses the native byte ordering, then you explicitly read the individual bytes in big-endian format (most significant byte first). Your machine is likely using little-endian format for byte ordering, which would explain the difference.
You need to read/write in a consistent manner. The simplest way to do this is to fread one variable at a time, just like you're writing:
fread(&recordNum, sizeof(recordNum), 1, file);
fread(&charval, sizeof(charval), 1, file);
fread(&time, sizeof(time), 1, file);
Also note the use of sizeof to calculate the size.
You problem is probably right here:
uint32_t timestamp =(uint32_t)record[6] | (uint32_t)record[5] << 8 | (uint32_t)record[4] << 16 | (uint32_t)record[3] << 24;
printf("Testing timestamp = %d\n", timestamp);
You've used fwrite to write out a 32 bit integer.. in whatever order the processor stored it in memory.. and you don't actually know what byte ordering (endian-ness) the machine used. Maybe the first byte written out is the lowest byte of the integer, or maybe it's the highest byte of the integer.
If you're reading and writing the data on the same machine, or on different machines with the same architecture, you don't need to care about that.. it will work. But if the data is written on an architecture with one byte ordering, and potentially read in on an architecture with another byte ordering, it will be wrong: Your code needs to know what order the bytes should be in memory and what order they will be read/written on disk.
In this case, in your code, you are doing a mix of both: You write them out in whatever endian-ness the machine uses natively.. then when you read them in, you start shifting the bits around as if you know what order they were originally in.. but you don't, because you didn't pay attention to the order when you wrote them out.
So if you're writing and reading the file on the same machine, or identical machine (same processor, OS, compiler, etc), just write them out in the native order (without worrying about what that is) and then read them back in exactly as you wrote them out. If you write them and read them on the same machine, it'll work.
So if your timestamp is located at offset 3 through 6 of your record, just do this:
uint_32t timestamp;
memcpy(×tamp, record+3, sizeof(timestamp);
Note that you cannot directly cast record+3 to a uint32_t pointer because it might violate the systems word alignment requirements.
Note also that you should probably be using time_t type to hold the timestamp, if you're on a unix-like system, that'll be the natural type supplied to hold epoch time values.
But if you are planning to move this file to another machine at any point and try to read it there, you could easily end up with your data on a system that has different endian-ness or different size for time_t. Simply writing bytes in and out of a file with no thought to the endian-ness or size of types on different operating systems is just fine for temporary files or for files which are meant to be used on one computer only and which will never be moved to other types of system.
Making data files that are portable between systems is a whole subject in itself. But the first thing you should do, if you care about that, is to look at functions htons(), ntonhs(), htonl(), ntonhl(), and their ilk.. which convert to and from the system native endian-ness to a known (big) endian-ness which is the standard for internet communications and generally used for interoperability (even though Intel processors are little-endian and dominate the market these days). These function do something similar to what you were doing with your bit-shifting but since someone else wrote it, you don't have to. It's a lot easier to use the library functions for this!
For example:
#include <stdio.h>
#include <arpa/inet.h>
int main() {
uint32_t x = 1234, y, z;
// open a file for writing, convert x from native to big endian, write it.
FILE *file = fopen("foo.txt", "w");
z = htonl(x);
fwrite(&z, sizeof(z), 1, file);
fclose(file);
file = fopen("foo.txt", "r");
fread(&z, sizeof(z), 1, file);
x = ntohl(z);
fclose(file);
printf("%d\n", x);
}
NOTE I am NOT CHECKING FOR ERRORS in this code, it is just an example.. do not use functions like fopen, fread etc without checking for errors.
By using these functions both when writing the data out to disk and when reading it back, you guarantee that the data on disk is always big-endian.. eg htonl() when on a big-endian platform does nothing, when on a little-endian platform it does the conversion from bit to little endian. And ntohl() does the opposite. So your data on disk will always be read in correctly.
I am trying to read double values from a binary in C, but the binary starts with an integer and then the doubles I am looking for.
How do I skip that first 4 bytes when reading with fread()?
Thanks
Try this:
fseek(input, sizeof(int), SEEK_SET);
before any calls to fread.
As Weather Vane said you can use sizeof(int) safely if the file was generated in the same system architecture as the program you are writing. Otherwise, you should manually specify the size of integer of the system where the file originated.
You can use fseek to skip the initial integer. If you insist on using fread anyway, then you can read the integer first:
fread(ptr, sizeof(int), 1, stream).
Of course you have to declare ptr before calling fread.
As I said, fseek is another option:
fseek(stream, sizeof(int), SEEK_SET).
Beware that fseek moves the file pointer in bytes (1 in the given line from the beginning of the file); integer can be 4 or other number of bytes which is system specific.
Be careful when implementing things like this. If the file isn't created on the same machine, you may get invalid values due to different floating point specifications.
If the file you're reading is created on the same machine, make sure that the program that writes, correctly address the type sizes.
If both writer and reader are developed in C and are supposed to run only on the same machine, use the fseek() with the sizeof(type) used in the writer in the offset parameter.
If the machine that writes the binary isn't the same that will read it, you probably don't want to even read the doubles with fread() as their format may differ due to possible different architectures.
Many architectures rely on the IEEE 754 for floating point format, but if the application is supposed to address multi-platform support, you should make sure that the serialized format can be read from all architectures (or converted while unserializing).
Just read those 4 unneeded bytes, like
void* buffer = malloc(sizeof(double));
fread(buffer,4,1,input); //to skip those four bytes
fread(buffer,sizeof(double),1,input); //then read first double =)
double* data = (double*)buffer;//then convert it to double
And so on
On a 32-bit system, what does ftell return if the current position indicator of a file opened in binary mode is past the 2GB point? In the C99 standard, is this undefined behavior since ftell must return a long int (maximum value being 2**31-1)?
on long int
long int is supposed to be AT LEAST 32-bits, but C99 standard does NOT limit it to 32-bit.
C99 standard does provide convenience types like int16_t & int32_t etc that map to correct bit sizes for a target platform.
on ftell/fseek
ftell() and fseek() are limited to 32 bits (including sign bit) on the vast majority of 32-bit architecture systems. So when there is large file support you run into this 2GB issue.
POSIX.1-2001 and SysV functions for fseek and ftell are fseeko and ftello because they use off_t as the parameter for the offset.
you do need to define compile with -D_FILE_OFFSET_BITS=64 or define it somewhere before including stdio.h to ensure that off_t is 64-bits.
Read about this at the cert.org secure coding guide.
On confusion about ftell and size of long int
C99 says long int must be at least 32-bits it does NOT say that it cannot be bigger
try the following on x86_64 architecture:
#include <stdio.h>
int main(int argc, char *argv[]) {
FILE *fp;
fp = fopen( "test.out", "w");
if ( !fp )
return -1;
fseek(fp, (1L << 34), SEEK_SET);
fprintf(fp, "\nhello world\n");
fclose(fp);
return 0;
}
Notice that 1L is just a long, this will produce a file that's 17GB and sticks a "\nhello world\n" to the end of it. Which you can verify is there by trivially using tail -n1 test.out or explicitly using:
dd if=test.out skip=$((1 << 25))
Note that dd typically uses block size of (1 << 9) so 34 - 9 = 25 will dump out '\nhello world\n'
At least on a 32bit OS ftell() it will overflow or error or simply run into Undefined Behaviour.
To get around this you might like to use off_t ftello(FILE *stream); and #define _FILE_OFFSET_BITS 64.
Verbatim from man ftello:
The fseeko() and ftello() functions are identical to fseek(3) and ftell(3) (see fseek(3)), respectively, except that the offset argument of fseeko() and the return value of ftello() is of type off_t instead of long.
On many architectures both off_t and long are 32-bit types, but compilation with
#define _FILE_OFFSET_BITS 64
will turn off_t into a 64-bit type.
Update:
According to IEEE Std 1003.1, 2013 Edition ftell() shall return -1 and set errno to EOVERFLOW in such cases:
EOVERFLOW
For ftell(), the current file offset cannot be represented correctly in an object of type long.
There is no 64b aware method in C99 standard. What OS/environment are you using? On windows, there is _ftelli64.
On other platforms, look at http://forums.codeguru.com/showthread.php?277234-Cannot-use-fopen()-open-file-larger-than-4-GB
This worked for me on Windows32/MinGW to play with a 6GB file
#define _FILE_OFFSET_BITS 64
#include<stdio.h>
int main() {
FILE *f = fopen("largefile.zip","rb");
fseeko64(f, 0, SEEK_END);
off64_t size = ftello64(f);
printf("%llu\n", size);
}
gcc readlargefile.c -c -std=C99 -o readlargefile.exe
Every single detail, the macro, the compiler option, matters.