Is filesize/stat consistent across all filesystems? - file

I am considering writing software which uses the filesize as pretest to test whether two files are equivalent. There is no need to apply sophisticated file content comparisons if a simple file size integer comparison fails. The software is going to written in golang (first), but I think this question really boils down to the stat syscall and therefore is language independent.
I necessarily need a platform-independent solution. It has to work across all systems and file systems. I can be sure that the file content will be the same sequence of bytes across all filesystems, but what about the filesize?
If I transfer a file from one filesystem to another, can I be sure to get the same filesize on the other filesystem?
[Of course, I don't care about file metadata. This is obviously inconsistent. I only care about content sizes]

Yes, st_size should be the same across all filesystems (at least if they are posix compliant). A byte is a byte after all, no matter where you store it. The disk space consumed can be different though, depending on the underlying block size of the filesystem.

Related

C: Reading large files with limited memory

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.
Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).
I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.
A couple of things I have considered include:
Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
A few smarts around caching data that I expect to be accessed several times in a short amount of time
I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading.
I guess I was always wondering whether pread is even the right choice in this situation....
Any thoughts/opinions/criticisms/etc more than welcome.
NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

How can I use LibC/WINAPI to use a dvd as a binary blob?

I was thinking recently, whenever I use a disc, I use it by either burning an image onto it, or by formatting it and using it like a USB. I never used it as a raw storage medium to poke bytes into/read bytes from.
I am now curious if it is possible to use a DVD as a blob of binary data that I can write bits onto as I please.
From what I understand, it is trivial to write to a DVD using C if I format it, so that I can interface it much like a typical C or D drive(I can even rename the disk name to C or D if I want to).
I'm curious if I can do the same without formatting the disk, so that the only bits on it are the ones that I write to, or the default ones.
To summarize, I want to be able to perform the following operations on an unformatted DVD-RW
read a bunch of bytes at an offset into an in-memory byte pool
overwrite a bunch of bytes at an offset from a in-memory byte pool without affecting other bytes on the disk
How can this be accomplished?
Thanks ahead of time.
On Linux, you can just open the block device and do sufficiently aligned writes:
Documentation/cdrom/packet-writing.txt in the kernel sources
You only need to format the media as DVD+RW once, using dvd+rw-format. This is a relatively simple procedure, so you could extract it from the source code of that tool.
However, according to the kernel documentation, what is a “sufficiently aligned write” is somewhat up to interpretation—the spec says 2 KiB, but some drives require more alignment. There is also no wear leveling or sector remapping at this layer, so good results really require that you use on-disk data structures which reflect that this technology is closer in reality to write-once rather than truly random access.

Under what circumstances will fseek/ftell or fstat fail to get the size of a file?

I'm trying to access a file as a char array, via memory mapping it, or copying it into a buffer or whatever, but both of these need the size of the file, easy enough, thought I, just use fseek(file, 0, SEEK_END).
However: according to C++ Reference "Library implementations [of fseek] are allowed to not meaningfully support SEEK_END," Meaning that I can't get the size of a file using that method.
Next I tried fstat, which is less portable, but at least will provide a compile error rather than a runtime problem; but The Open Group notes that fstat does not need to provide a meaningful value for st_size.
So: has anyone actually come across a system where these methods do not work?
The notes about files not having valid sizes reported are there because, in Linux, there are many "files" for which "file size" is not a meaningful concept.
There are two main cases:
The file is not a regular file. In particular, pipes, sockets, and character device files are streams of data where data is consumed on read, and not put on disk, so a size does not make much sense.
The file system that the file resides on does not provide the file size. This is especially common in "virtual" filesystems, where the file contents are generated when read and, again, have no disk backing.
To expand, filesystems do not necessarily keep file contents on disk. Since the filesystem API is a convenient API for expressing hierarchal data, and there are many tools for operating on files, it sometimes makes sense to expose data as a file hierarchy. For example, /proc/ contains information about processes (such as open files and used memory) and /sys/ contains driver-specific information and options (anything from sensor sampling rates to LED colors). With FUSE (Filesystem in UserSpacE), you can program a filesystem to do pretty much anything, from SSHing into a remote computer to exposing Twitter as a filesystem.
For a lot of these filesystems, "file size" may not make much sense. For example, an LED driver might expose three files red, green, and blue. They can be read to get the current color or written to to change the color. Now, is it really worth implementing a file size for them, since they are merely settings in RAM, don't have any disk backing, and can't be removed? Not really.
In summary, files are not necessarily "things on disk". For many of the more advanced usages of files, "file size" either does not make sense or is not worth providing.

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

Does fread fail for large files?

I have to analyze a 16 GB file. I am reading through the file sequentially using fread() and fseek(). Is it feasible? Will fread() work for such a large file?
You don't mention a language, so I'm going to assume C.
I don't see any problems with fread, but fseek and ftell may have issues.
Those functions use long int as the data type to hold the file position, rather than something intelligent like fpos_t or even size_t. This means that they can fail to work on a file over 2 GB, and can certainly fail on a 16 GB file.
You need to see how big long int is on your platform. If it's 64 bits, you're fine. If it's 32, you are likely to have problems when using ftell to measure distance from the start of the file.
Consider using fgetpos and fsetpos instead.
Thanks for the response. I figured out where I was going wrong. fseek() and ftell() do not work for files larger than 4GB. I used _fseeki64() and _ftelli64() and it is working fine now.
If implemented correctly this shouldn't be a problem. I assume by sequentially you mean you're looking at the file in discrete chunks and advancing your file pointer.
Check out http://www.computing.net/answers/programming/using-fread-with-a-large-file-/10254.html
It sounds like he was doing nearly the same thing as you.
It depends on what you want to do. If you want to read the whole 16GB of data in memory, then chances are that you'll run out of memory or application heap space.
Rather read the data chunk by chunk and do processing on those chunks (and free resources when done).
But, besides all this, decide which approach you want to do (using fread() or istream, etc.) and do some test cases to see which works better for you.
If you're on a POSIX-ish system, you'll need to make sure you've built your program with 64-bit file offset support. POSIX mandates (or at least allows, and most systems enforce this) the implementation to deny IO operations on files whose size don't fit in off_t, even if the only IO being performed is sequential with no seeking.
On Linux, this means you need to use -D_FILE_OFFSET_BITS=64 on the gcc command line.

Resources