I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.
Related
On Linux, is it possible to get the buffer size required for getdents64 to get all the entries in one go (assuming no modifications to the directory after the size is obtained)?
I tried the value from fstat(dirfd,&stb); stb.st_size but it appears unnecessarily too large? What is the value that stat::st_size holds for directories?
As far as I know, no, there is no way to do this, especially not in a manner that works for arbitrary filesystems. The representation of directories is defined by the underlying filesystem, and the same seems to be true of st_size. Certainly there is no way to get the right value for FUSE or 9p or other remote/virtual filesystems.
Why do you want to do this? I don't think it's useful. Once you get beyond a few kB per call, the overhead of the syscalls will be dominated by the actual work done. If you really care you can wrap the getdents64 syscall with a function that keeps resizing the buffer and calling it until EOF is reached. Or you could just use the portable opendir/readdir interfaces which perform appropriate buffering for you.
I am considering writing software which uses the filesize as pretest to test whether two files are equivalent. There is no need to apply sophisticated file content comparisons if a simple file size integer comparison fails. The software is going to written in golang (first), but I think this question really boils down to the stat syscall and therefore is language independent.
I necessarily need a platform-independent solution. It has to work across all systems and file systems. I can be sure that the file content will be the same sequence of bytes across all filesystems, but what about the filesize?
If I transfer a file from one filesystem to another, can I be sure to get the same filesize on the other filesystem?
[Of course, I don't care about file metadata. This is obviously inconsistent. I only care about content sizes]
Yes, st_size should be the same across all filesystems (at least if they are posix compliant). A byte is a byte after all, no matter where you store it. The disk space consumed can be different though, depending on the underlying block size of the filesystem.
I'm using 64bit mingw to compile c code on windows x64.
I'm using fwrite to create binary files from memory array. I want to write ~20Gb calling this function but it just write until 1.4~1.5gb and then it stops writting (without crashing, just hangs there.... doing nothing).
Is there any solution? Right now I'm writing 20 files and then I merge them.
Opening the file as 'ab' works but I cant read the file properly if I use that mode.
Sample (pseudo)code:
short* dst= malloc(20GB);
*calculations to fill dst*
file=fopen("myfile",'wb');
fwrite(dst, sizeof(short), 20GB/sizeof(short), file);
fclose(file)
That program never ends and file size is never grater than 1.5GB
Write it in smaller chunks. For heaven's sake, don't try to malloc 20gb.
Depending on the environment (operating system, memory model, file system), it might not be possible to create a file greater than 2 GB. This is especially true with MSDOS file systems and of course could be true on any file system if there is insufficient disk space or allocation quota.
If you show your code, we could see if there is any intrinsic flaw in the algorithm and suggest alternatives.
Mingw is a 32 bit environment, there AFAIK does not exist a 64 bit variant.
It may be that fwrite() from mingw is unable to deal with more than 2 GB or 4GB unless mingw is large file aware.
If you can find something similar to truss(1), run your progran under this debugging tool. With the information you provided, it is not possible to give a better advise.
I have a large text file to be opened (eg- 5GB size). But with a limited RAM (take 1 GB), How can I open and read the file with out any memory error? I am running on a linux terminal with with the basic packages installed.
This was an interview question, hence please do not look into the practicality.
I do not know whether to look at it in System level or programmatic level... It would be great if someone can throw some light into this issue.
Thanks.
Read it character by character... or X bytes by X bytes... it really depends what you want to do with it... As long as you don't need the whole file at once, that works.
(Ellipses are awesome)
What do they want you to do with the file? Are you looking for something? Extracting something? Sorting? This will affect your approach.
It may be sufficient to read the file line by line or character by character if you're looking for something. If you need to jump around the file or analyze sections of it, then most likely want to memory map it. Look up mmap(). Here's an short article on the subject:memory mapped i/o
[just comment]
If you are going to use system calls (open() and read()), then reading character by character will generate a lot of system calls that severely slow down your application. Even with the existence of the buffer cache (or disk file), system calls are expensive.
It is better to read block by block where block size "SHOULD" be more than 1MB. In case of 1MB block size, you will issue 5*1024 system calls.
I've read posts that show how to use fseek and ftell to determine the size of a file.
FILE *fp;
long file_size;
char *buffer;
fp = fopen("foo.bin", "r");
if (NULL == fp) {
/* Handle Error */
}
if (fseek(fp, 0 , SEEK_END) != 0) {
/* Handle Error */
}
file_size = ftell(fp);
buffer = (char*)malloc(file_size);
if (NULL == buffer){
/* handle error */
}
I was about to use this technique but then I ran into this link that describes a potential vulnerability.
The link recommends using fstat instead. Can anyone comment on this?
The link is one of the many nonsensical pieces of C coding advice from CERT. Their justification is based on liberties the C standard allows an implementation to take, but which are not allowed by POSIX and thus irrelevant in all cases where you have fstat as an alternative.
POSIX requires:
that the "b" modifier for fopen have no effect, i.e. that text and binary mode behave identically. This means their concern about invoking UB on text files is nonsense.
that files have a byte-resolution size set by write operations and truncate operations. This means their concern about random numbers of null bytes at the end of the file is nonsense.
Sadly with all the nonsense like this they publish, it's hard to know which CERT publications to take seriously. Which is a shame, because lots of them are serious.
If your goal is to find the size of a file, definitely you should use fstat() or its friends. It's a much more direct and expressive method--you are literally asking the system to tell you the file's statistics, rather than the more roundabout fseek/ftell method.
A bonus tip: if you only want to know if the file is available, use access() rather than opening the file or even stat'ing it. This is an even simpler operation which many programmers aren't aware of.
The reason to not use fstat is that fstat is POSIX, but fopen, ftell and fseek are part of the C Standard.
There may be a system that implements the C Standard but not POSIX. On such a system fstat would not work at all.
I'd tend to agree with their basic conclusion that you generally shouldn't use the fseek/ftell code directly in the mainstream of your code -- but you probably shouldn't use fstat either. If you want the size of a file, most of your code should use something with a clear, direct name like filesize.
Now, it probably is better to implement that using fstat where available, and (for example) FindFirstFile on Windows (the most obvious platform where fstat usually won't be available).
The other side of the story is that many (most?) of the limitations on fseek with respect to binary files actually originated with CP/M, which didn't explicitly store the size of a file anywhere. The end of a text file was signaled by a control-Z. For a binary file, however, all you really knew was what sectors were used to store the file. In the last sector, you had some amount of unused data that was often (but not always) zero-filled. Unfortunately, there might be zeros that were significant, and/or non-zero values that weren't significant.
If the entire C standard had been written just before being approved (e.g., if it had been started in 1988 and finished in 1989) they'd probably have ignored CP/M completely. For better or worse, however, they started work on the C standard in something like 1982 or so, when CP/M was still in wide enough use that it couldn't be ignored. By the time CP/M was gone, many of the decisions had already been made and I doubt anybody wanted to revisit them.
For most people today, however, there's just no point -- most code won't port to CP/M without massive work; this is one of the relatively minor problems to deal with. Making a modern program run in only 48K (or so) of memory for both the code and data is a much more serious problem (having a maximum of a megabyte or so for mass storage would be another serious problem).
CERT does have one good point though: you probably should not (as is often done) find the size of a file, allocate that much space, and then assume the contents of the file will fit there. Even though the fseek/ftell will give you the correct size with modern systems, that data could be stale by the time you actually read the data, so you could overrun your buffer anyway.
According to C standard, §7.21.3:
Setting the file position indicator to end-of-file, as with fseek(file,
0, SEEK_END), has undefined behavior for a binary stream (because of
possible trailing null characters) or for any stream with
state-dependent encoding that does not assuredly end in the initial
shift state.
A letter-of-the-law kind of guy might think this UB can be avoided by calculating file size with:
fseek(file, -1, SEEK_END);
size = ftell(file) + 1;
But the C standard also says this:
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
As a result, there is nothing we can do to fix this with regard to fseek / SEEK_END. Still, I would prefer fseek / ftell instead of OS-specific API calls.