Cross-platform seeking for large files [duplicate] - c

I am running into integer overflow using the standard ftell and fseek options inside of G++, but I guess I was mistaken because it seems that ftell64 and fseek64 are not available. I have been searching and many websites seem to reference using lseek with the off64_t datatype, but I have not found any examples referencing something equal to fseek. Right now the files that I am reading in are 16GB+ CSV files with the expectation of at least double that.
Without any external libraries what is the most straightforward method for achieving a similar structure as with the fseek/ftell pair? My application right now works using the standard GCC/G++ libraries for 4.x.

fseek64 is a C function. To make it available you'll have to define _FILE_OFFSET_BITS=64 before including the system headers That will more or less define fseek to be actually fseek64. Or do it in the compiler arguments e.g.
gcc -D_FILE_OFFSET_BITS=64 ....
http://www.suse.de/~aj/linux_lfs.html has a great overviw of large file support on linux:
Compile your programs with "gcc -D_FILE_OFFSET_BITS=64". This forces all file access calls to use the 64 bit variants. Several types change also, e.g. off_t becomes off64_t. It's therefore important to always use the correct types and to not use e.g. int instead of off_t. For portability with other platforms you should use getconf LFS_CFLAGS which will return -D_FILE_OFFSET_BITS=64 on Linux platforms but might return something else on e.g. Solaris. For linking, you should use the link flags that are reported via getconf LFS_LDFLAGS. On Linux systems, you do not need special link flags.
Define _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE. With these defines you can use the LFS functions like open64 directly.
Use the O_LARGEFILE flag with open to operate on large files.

If you want to stick to ISO C standard interfaces, use fgetpos() and fsetpos(). However, these functions are only useful for saving a file position and going back to the same position later. They represent the position using the type fpos_t, which is not required to be an integer data type. For example, on a record-based system it could be a struct containing a record number and offset within the record. This may be too limiting.
POSIX defines the functions ftello() and fseeko(), which represent the position using the off_t type. This is required to be an integer type, and the value is a byte offset from the beginning of the file. You can perform arithmetic on it, and can use fseeko() to perform relative seeks. This will work on Linux and other POSIX systems.
In addition, compile with -D_FILE_OFFSET_BITS=64 (Linux/Solaris). This will define off_t to be a 64-bit type (i.e. off64_t) instead of long, and will redefine the functions that use file offsets to be the versions that take 64-bit offsets. This is the default when you are compiling for 64-bit, so is not needed in that case.

fseek64() isn't standard, the compiler docs should tell you where to find it.
Have you tried fgetpos and fsetpos? They're designed for large files and the implementation typically uses a 64-bit type as the base for fpos_t.

Have you tried fseeko() with the _FILE_OFFSET_BITS preprocessor symbol set to 64?
This will give you an fseek()-like interface but with an offset parameter of type off_t instead of long. Setting _FILE_OFFSET_BITS=64 will make off_t a 64-bit type.
The same for goes for ftello().

Use fsetpos(3) and fgetpos(3). They use the fpos_t datatype , which I believe is guaranteed to be able to hold at least 64 bits.

Related

What is the purpose of libc_nonshared.a?

Why does libc_nonshared.a exist? What purpose does it serve? I haven't been able to find a good answer for its existence online.
As far as I can tell it provides certain symbols (stat, lstat, fstat, atexit, etc.). If someone uses one of these functions in their code, it will get linked into the final executable from this archive. These functions are part of the POSIX standard and are pretty common so I don't see why they wouldn't just be put in the shared or static libc.so.6 or libc.a, respectively.
It was a legacy mistake in glibc's implementing extensibility for the definition of struct stat before better mechanisms (symbol redirection or versioning) were thought of. The definitions of the stat-family functions in libc_nonshared.a cause the version of the structure to bind at link-time, and the definitions there call the __xstat-family functions in the real shared libc, which take an extra argument indicating the desired structure version. This implementation is non-conforming to the standard since each shared library ends up gettings its own copy of the stat-family functions with their own addresses, breaking the requirement that pointers to the same function evaluate equal.
Here's the problem. Long ago, members of the struct stat structure had different sizes than they had today. In particular:
uid_t was 2 bytes (though I think this one was fixed in the transition from libc5 to glibc)
gid_t was 2 bytes
off_t was 4 bytes
blkcnt_t was 4 bytes
time_t was 4 bytes
also, timespec wasn't used at all and there was no room for nanosecond precision.
So all of these had to change. The only real solution was to make different versions of the stat() system call and library function and you get the version you compiled against. That is, the .a file matches the header files. These things didn't all change at once, but I think we're done changing them now.
You can't really solve this by a macro because the structure name is the same as the function name; and inline wasn't mandated to exist in the beginning so glibc couldn't demand everybody use it.
I remember there used to be this thing O_LARGEFILE for saying you could handle files bigger than 4GB; otherwise things just wouldn't work. We also used to have to define things like _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE but it's all handled automatically now. Back in the day, if you weren't ready for large file support yet, you didn't define these and you didn't get the 64-bit version of the stat structure; and also worked on older kernel versions lacking the new system calls. I haven't checked; it's possible that 32-bit compilation still doesn't define these automatically, but 64-bit always does.
So you probably think; okay, fine, just don't franken-compile stuff? Just build everything that goes into the final executable with the same glibc version and largefile-choice. Ever use plugins such as browser plugins? Those are pretty much guaranteed to be compiled in different places with different compiler and glibc versions and options; and this didn't require you to upgrade your browser and replace all its plugins at the same time.

Does _FILE_OFFSET_SIZE work after prior inclusion of stdio.h?

I want to use fseeko, and so have this:
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
Because source is intended for all sorts of platforms and architectures there are many surrounding preprocessor switches for each, making it unpreferable to copy into source files where used. However, if putting this in a header I fear this would require that header to be included first, which may be difficult to enforce.
Testing whether an stdio.h macro is defined or not might work, e.g. SEEK_CUR, but that is ugly in its own ways since there is no standard, self-explainatory STDIO_INCLUDED-macro to test for.
Some headers allow multiple inclusion with different switches, like assert.h and NDEBUG. I was wondering if this is also the case according to POSIX standard for _FILE_OFFSET_BITS and other similar switches.
Ugh. Why does my reading skills get a boost right after asking a question in public. From fseeko page, emphasis mine:
On some architectures, both off_t and long are 32-bit types, but defining _FILE_OFFSET_BITS with the value 64 (before including any header files) will turn off_t into a 64-bit type.
The answer is no.

Can I seek a position beyond 2GB in C using the standard library?

I am making a program that reads disk images in C. I am trying to make something portable, so I do not want to use too many OS-specific libraries. I am aware there are many disk images that are very large files but I am unsure how to support these files.
I have read up on fseek and it seems to use a long int which is not guaranteed to support values over 231-1. fsetpos seems to support a larger value with fpos_t but an absolute position cannot be specified. I have also though about using several relative seeks with fseek but am unsure if this is portable.
How can I support portably support large files in C?
There is no portable way.
On Linux there are fseeko() and ftello(), pair (need some defines, check ftello()).
On Windows, I believe you have to use _fseeki64() and _ftelli64()
#ifdef is your friend
pread() works on any POSIX-compliant platform (OS X, Linux, BSD, etc.). It's missing on Windows but there are lots of standard things that Windows gets wrong; this won't be the only thing in your codebase that needs a Windows special case.
You can't do it with standard C. Even with relative seeks it's not possible on some architectures.
One approach would be to check the platform at compile time. You can just check the value of LONG_MAX and throw a compile error if it's not large enough. But even that doesn't guarantees that the underlying filesystem supports files larger than 2 or 4GB.
A better way is to use the pre-processor macros supplied by your compiler to check the operating system that your code is being compiled for and write operating system specific specific. The operating system should provide a way to check that the filesystem actually supports files larger than 2GB or 4GB.

What is the difference between _LARGEFILE_SOURCE and _FILE_OFFSET_BITS=64?

I understand that -D_FILE_OFFSET_BITS=64 causes off_t to be 64bits. So what does -D_LARGEFILE_SOURCE do that isn't already done by -D_FILE_OFFSET_BITS=64? What do these definitions do exactly?
The GLIBC Feature test macros documentation states:
_LARGEFILE_SOURCE
If this macro is defined some extra functions are available which rectify a few shortcomings in all previous standards. Specifically, the functions fseeko and ftello are available. Without these functions the difference between the ISO C interface (fseek, ftell) and the low-level POSIX interface (lseek) would lead to problems.
This macro was introduced as part of the Large File Support extension (LFS).
So that macro specifically makes fseeko and ftello available. _FILE_OFFSET_BITS settings alone don't make these functions available.
(Note that if you're using a GNU dialect of C, the default with GCC, you might not need to explicitly define _LARGEFILE_SOURCE. You do if you use -std=c99 for instance.)
The other answer is wrong, as the documentation for _LARGEFILE_SOURCE is misleading. _FILE_OFFSET_BITS=64 is sufficient to expose the fseeko and ftello functions, and so is a _POSIX_C_SOURCE macro defined to >= 200112L.
From the glibc documentation on _FILE_OFFSET_BITS
If the macro is defined to the value 64, the large file interface replaces the old interface. I.e., the functions are not made available under different names (as they are with _LARGEFILE64_SOURCE). Instead the old function names now reference the new functions, e.g., a call to fseeko now indeed calls fseeko64.
Always define _FILE_OFFSET_BITS=64 to switch over to the 64-bit types on 32-bit glibc-based systems. glibc should really make it the default...
LFS
Large File Support (LFS), provides extra functionality required +- to access large files
large file == On 32-bit architectures, file larger than 2GB
We can write applications requiring LFS functionality in one of two ways:
Define the _LARGEFILE64_SOURCE feature test macro when compiling our program. -- Use the transitional LFS API.
Define the _FILE_OFFSET_BITS macro with the value 64 ++ when compiling our programs.
_LARGEFILE64_SOURCE
Define the _LARGEFILE64_SOURCE feature test macro when compiling our program. -- Use the transitional LFS API.
This API provides functions capable of handling 64-bit file sizes and offsets.
These functions have the same names as their 32-bit counterparts,
++ but have the suffix 64 appended +- to the function name.
Among these functions are
fopen64(),
open64(),
lseek64(),
truncate64(),
stat64(),
mmap64(),
setrlimit64().
#eg::
fd = open64(name, O_CREAT | O_RDWR, mode);
_FILE_OFFSET_BITS
Define the _FILE_OFFSET_BITS macro with the value 64 ++ when compiling our programs.
This automatically converts all of the relevant 32-bit functions and data types into their 64-bit counterparts.
#eg:: for example,
calls to open() are actually converted into calls to open64(), and
the off_t data type is defined +- to be 64 bits long.
#ie:: In other words,
we can recompile an existing program +- to handle large files without needing +- to make any changes +- to the source code.
vs
Using _FILE_OFFSET_BITS is clearly simpler than using the _LARGEFILE64_SOURCE (transitional LFS API).
_LARGEFILE64_SOURCE (transitional LFS API) is now obsolete.
_FILE_OFFSET_BITS is preferred.
reference
The Linux Programming Interface
(most content was directly copied from here)

Binary compatibility of FILE*

I am designing C library which does some mathematical calculations. I need to specify serialization interface to be able to save and then load some data. The question is, is it correct (from binary compatibility point of view) to use FILE* pointer in the public API of library?
Target platfoms are:
Linux x86, x86_64 with gcc >= 3.4.6
Windows x86, x86_64 >= WinXP with VS >= 2008sp1
I need to be as much binary compatible as it possible, so at the moment my variant is the following:
void SMModuleSave(SMModule* module, FILE* dest);
SMModule* SMModuleLoad(FILE* src);
So I am curious if it is correct to use FILE* or better switch to wchar*/char* ?
I don't agree with ThiefMaster: there's no benefit in going native (ie using file descriptors of type int on linux and handles of type void * on windows) when there's an equivalent portable solution.
I'd probably go with FILE * instead of opening the files by name from within the library: It might be more of a hassle for library users, but it's also more flexible as most libc implementations provide various ways for file opening (fopen(), _wfopen(), _fdopen(), fdopen(), fmemopen(),...) and you don't have to maintain seperate wide-char APIs yourself.
I'd use neither but let the user pass a file descriptor as an int.
Then you can fdopen() it in your code to get a FILE*.
However, when using windows it might not be the best solution even though it does have some helper functions to get a numeric file descriptor.
However, passing a FILE* or a const char* should be fine, too. I'd prefer passing a filename as it's less code to write if a library takes care of opening/closing a file.
Yes it is correct, from a stable binary interface perspective, to use FILE * here. I think perhaps you're confusing this with using FILE as opposed to a pointer to it. Notice that your standard library's fopen, fgets, etc. functions all use (both as arguments and return values) the FILE * type as part of their public interfaces.
A FILE * is a standard ANSI/ISO C89 and C99 (even K&R) type. It is a portability dream and I'd prefer it over anything else. You're safe to go with it. It won't get any better than that.

Resources