I know fgetpos/fsetpos are used for returning to a file position.
But if I accessed that position with fseek to begin with, is it more efficient to use fgetpos/fsetpos to return later, or just the same fseek again?
Is fgetpos/fsetpos any faster than fseek?
For general file positioning, fseek()/ftell() are limited to files sizes about LONG_MAX. fsetpos()/fgetpos() are designed to handle the file system's file sizes.
For large files, fseek()/ftell() are not an option. #Thomas Padron-McCarthy
When coding C99 onward, robust code uses fsetpos()/fgetpos() in lieu of a minor optimization that may of may not be present using the more limited fseek()/ftell().
Related
There are many similar questions, but nothing that answers this specifically after googling around quite a bit. Here goes:
Say we have a file (could be binary, and much bigger too):
abcdefghijklmnopqrztuvwxyz
what is the best way in C to "move" a right most portion of this file to the left, truncating the beginning of the file.. so, for example, "front truncating" 7 bytes would change the file on disk to be:
hijklmnopqrztuvwxyz
I must avoid temporary files, and would prefer not to use a large buffer to read the whole file into memory. One possible method I thought of is to use fopen with "rb+" flag, and constantly fseek back and forth reading and writing to copy bytes starting from offset to the beginning, then setEndOfFile to truncate at the end. That seems to be a lot of seeking (possibly inefficient).
Another way would be to fopen the same file twice, and use fgetc and fputc with the respective file pointers. Is this even possible?
If there are other ways, I'd love to read all of them.
You could mmap the file into memory and then memmove the contents. You would have to truncate the file separately.
You don't have to use an enormous buffer size, and the kernel is going to be doing the hard work for you, but yes, reading a buffer full from up the file and writing nearer the beginning is the way to do it if you can't afford to do the simpler job of create a new file, copy what you want into that file, and then copy the new (temporary) file over the old one. I wouldn't rule out the possibility that the approach of copying what you want to a new file and then either moving the new file in place of the old or copying the new over the old will be faster than the shuffling process you describe. If the number of bytes to be removed was a disk block size, rather than 7 bytes, the situation might be different, but probably not. The only disadvantage is that the copying approach requires more intermediate disk space.
Your outline approach will require the use of truncate() or ftruncate() to shorten the file to the proper length, assuming you are on a POSIX system. If you don't have truncate(), then you will need to do the copying.
Note that opening the file twice will work OK if you are careful not to clobber the file when opening for writing - using "r+b" mode with fopen(), or avoiding O_TRUNC with open().
If you are using Linux, since Kernel 3.15 you can use
#include <fcntl.h>
int fallocate(int fd, int mode, off_t offset, off_t len);
with the FALLOC_FL_COLLAPSE_RANGE flag.
http://manpages.ubuntu.com/manpages/disco/en/man2/fallocate.2.html
Note that not all file systems support it but most modern ones such as ext4 and xfs do.
I've read posts that show how to use fseek and ftell to determine the size of a file.
FILE *fp;
long file_size;
char *buffer;
fp = fopen("foo.bin", "r");
if (NULL == fp) {
/* Handle Error */
}
if (fseek(fp, 0 , SEEK_END) != 0) {
/* Handle Error */
}
file_size = ftell(fp);
buffer = (char*)malloc(file_size);
if (NULL == buffer){
/* handle error */
}
I was about to use this technique but then I ran into this link that describes a potential vulnerability.
The link recommends using fstat instead. Can anyone comment on this?
The link is one of the many nonsensical pieces of C coding advice from CERT. Their justification is based on liberties the C standard allows an implementation to take, but which are not allowed by POSIX and thus irrelevant in all cases where you have fstat as an alternative.
POSIX requires:
that the "b" modifier for fopen have no effect, i.e. that text and binary mode behave identically. This means their concern about invoking UB on text files is nonsense.
that files have a byte-resolution size set by write operations and truncate operations. This means their concern about random numbers of null bytes at the end of the file is nonsense.
Sadly with all the nonsense like this they publish, it's hard to know which CERT publications to take seriously. Which is a shame, because lots of them are serious.
If your goal is to find the size of a file, definitely you should use fstat() or its friends. It's a much more direct and expressive method--you are literally asking the system to tell you the file's statistics, rather than the more roundabout fseek/ftell method.
A bonus tip: if you only want to know if the file is available, use access() rather than opening the file or even stat'ing it. This is an even simpler operation which many programmers aren't aware of.
The reason to not use fstat is that fstat is POSIX, but fopen, ftell and fseek are part of the C Standard.
There may be a system that implements the C Standard but not POSIX. On such a system fstat would not work at all.
I'd tend to agree with their basic conclusion that you generally shouldn't use the fseek/ftell code directly in the mainstream of your code -- but you probably shouldn't use fstat either. If you want the size of a file, most of your code should use something with a clear, direct name like filesize.
Now, it probably is better to implement that using fstat where available, and (for example) FindFirstFile on Windows (the most obvious platform where fstat usually won't be available).
The other side of the story is that many (most?) of the limitations on fseek with respect to binary files actually originated with CP/M, which didn't explicitly store the size of a file anywhere. The end of a text file was signaled by a control-Z. For a binary file, however, all you really knew was what sectors were used to store the file. In the last sector, you had some amount of unused data that was often (but not always) zero-filled. Unfortunately, there might be zeros that were significant, and/or non-zero values that weren't significant.
If the entire C standard had been written just before being approved (e.g., if it had been started in 1988 and finished in 1989) they'd probably have ignored CP/M completely. For better or worse, however, they started work on the C standard in something like 1982 or so, when CP/M was still in wide enough use that it couldn't be ignored. By the time CP/M was gone, many of the decisions had already been made and I doubt anybody wanted to revisit them.
For most people today, however, there's just no point -- most code won't port to CP/M without massive work; this is one of the relatively minor problems to deal with. Making a modern program run in only 48K (or so) of memory for both the code and data is a much more serious problem (having a maximum of a megabyte or so for mass storage would be another serious problem).
CERT does have one good point though: you probably should not (as is often done) find the size of a file, allocate that much space, and then assume the contents of the file will fit there. Even though the fseek/ftell will give you the correct size with modern systems, that data could be stale by the time you actually read the data, so you could overrun your buffer anyway.
According to C standard, §7.21.3:
Setting the file position indicator to end-of-file, as with fseek(file,
0, SEEK_END), has undefined behavior for a binary stream (because of
possible trailing null characters) or for any stream with
state-dependent encoding that does not assuredly end in the initial
shift state.
A letter-of-the-law kind of guy might think this UB can be avoided by calculating file size with:
fseek(file, -1, SEEK_END);
size = ftell(file) + 1;
But the C standard also says this:
A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END.
As a result, there is nothing we can do to fix this with regard to fseek / SEEK_END. Still, I would prefer fseek / ftell instead of OS-specific API calls.
I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.
I am looking for various ways of reading/writing data from stdin/stdout. Currently I know about scanf/printf, getchar/putchar and gets/puts. Are there any other ways of doing this? Also I am interesting in knowing that which one is most efficient in terms of Memory and Space.
Thanks in Advance
fgets()
fputs()
read()
write()
And others, details can be found here: http://www.cplusplus.com/reference/clibrary/cstdio/
As per your time question take a look at this: http://en.wikipedia.org/wiki/I/O_bound
Stdio is designed to be fairly efficient no matter which way you prefer to read data. If you need to do character-by-character reads and writes, they usually expand to macros which just access the buffer except when it's full/empty. For line-by-line text io, use puts/fputs and fgets. (But NEVER use gets because there's no way to control how many bytes it will read!) The printf family (e.g. fprintf) is of course extremely useful for text because it allows you to skip constructing a temporary buffer in memory before writing (and thus lets you avoid thinking about all the memory allocation, overflow, etc. issues). fscanf tends to be much less useful, but mostly because it's difficult to use. If you study the documentation for fscanf well and learn how to use %[, %n, and the numeric specifiers, it can be very powerful!
For large blocks of text (e.g. loading a whole file into memory) or binary data, you can also use the fread and fwrite functions. You should always pass 1 for the size argument and the number of bytes to read/write for the count argument; otherwise it's impossible to tell from the return value how much was successfully read or written.
If you're on a reasonably POSIX-like system (pretty much anything) you can also use the lower-level io functions open, read, write, etc. These are NOT part of the C standard but part of POSIX, and non-POSIX systems usually provide the same functions but possibly with slightly-different behavior (for example, file descriptors may not be numbered sequentially 0,1,2,... like POSIX would require).
If you're looking for immediate-mode type stuff don't forget about Curses (more applicable on the *NIX side but also available on Windows)
I'm writing several C programs for an embedded system where every bit of performance we can squeeze out will matter. Part of that is accessing log files. When determining if a file exists, is there any performance difference between using open / fopen, and stat ? I've been using stat on the assumption that it only has to do a quick check against the file system, whereas fopen would have to actually gain access to a file and manipulate internal data structures before returning. Is there any merit to this ?
stat is probably better, since it doesn't have to allocate resources for actually reading the file. You won't have to call fclose to release those resources, and you may also benefit from caching of recently checked files.
When it doubt, test it out. Time a big loop that checks for 1000 files using each method, with the appropriate mix of filenames that exist and don't exist.
If you have the source code for stat and fopen, you should be able to read through it and get an idea as to which will require more resources.
stat() does not not to create any user-side memory data structures. No matter how aggressive your caching policy, stat will not try pre-read the file's data. I think stat() is a safer bet.
How about access()?
If you want to squeeze out performance with respect to querying file existence and opening files, minimize the number of fopen and stat calls in general. The call to the file system should be way more expensive than anything the runtime does to translate it.
For only testing file existence, stat() would be preferred over fopen().
However, depending upon your setup, it could be worthwhile to use lstat() instead of stat().