How does one write files to disk, sequentially, in C? - c

I want to write a program that writes data as one contiguous block of data to disk, so that when I read that data back from the disk, I can just read one long series of bytes without stopping. Are there any references I can be directed to regarding this issue?
I am essentially asking whether or not it is possible to write data for multiple files contiguously and read past an EOF, or many, to retrieve the data written.
I am aware of fwrite and fopen, I just want to be sure that the data being written to disk is contiguous.

It depends on what the underlying filesystem is, as this is filesystem-dependent. You'll want to look at extents, which are a contiguous area of storage reserved for a file.

On Windows you can open an unformatted volume with CreateFile and then WriteFile a contiguous block of data. It won't be a file, but you will be able to read it back as you stated.
According to this NTFS tries to allocate contiguous space if possible, your chances are lower when appending though.

Related

Reading a file of arbitrary length in C

What's the most idiomatic/efficient way to read a file of arbitrary length in C?
Get the filesize of the file in bytes and issue a single fread()
Keep fread()ing a constant size buffer until getting EOF
Anything else?
Avoid using any technique which requires knowing the size of the file in advance. That leaves exactly one technique: read the file a bit at a time, in blocks of a convenient size.
Here's why you don't want to try to find the filesize in advance:
If it is not a regular file, there may not be any way to tell. For example, you might be reading directly from a console, or taking piped input from a previous data generator. If your program requires the filesize to be knowable, these useful input mechanisms will not be available to your users, who will complain or choose a different tool.
Even if you can figure out the filesize, you have no way of preventing it from changing while you are reading the file. If you are not careful about how you read the file, you might open a vulnerability which could be exploited by adversarial programs.
For example, if you allocate a buffer of the "correct" size and then read until you get an end-of-file condition, you may end up overwriting random memory. (Multiple reads may be necessary if you use an interface like read() which might read less data than requested.) Or you might find that the file has been truncated; if you don't check the amount of data read, you might end up processing uninitialised memory leading to information leakage.
In practice, you usually don't need to keep the entire file content in memory. You'll often parse the file (notably if it is textual), or at least read the file in smaller pieces, and for that you don't need it entirely in memory. For a textual file, reading it line-by-line (perhaps with some state inside your parser) is often enough (using fgets or getline).
Files exist (notably on disks or SSDs) because usually they can be much "bigger" than your computer RAM. Actually, files have been invented (more than 50 years ago) to be able to deal with data larger than memory. Distributed file systems also can be very big (and accessed remotely even from a laptop, e.g. by NFS, CIFS, etc...)
Some file systems are capable of storing petabytes of data (on supercomputers), with individual files of many terabytes (much larger than available RAM).
You'll also likely to use some databases. These routinely have terabytes of data. See also this answer (about realistic size of sqlite databases).
If you really want to read a file entirely in memory using stdio (but you should avoid doing that, because you generally want your program to be able to handle a lot of data on files; so reading the entire file in memory is generally a design error), you indeed could loop on fread (or fscanf, or even fgetc) till end-of-file. Notice that feof is useful only after some input operation.
On current laptop or desktop computers, you could prefer (for efficiency) to use buffers of a few megabytes, and you certainly can deal with big files of several hundreds of gigabytes (much larger than your RAM).
On POSIX file systems, you might do memory mapped IO with e.g. mmap(2) - but that might not be faster than read(2) with large buffers (of a few megabytes). You could use readahead(2) (Linux specific) and posix_fadvise(2) (or madvise(2) if using mmap) to tune performance by hinting your OS kernel.
If you have to code for Microsoft Windows, you could study its WinAPI and find some way to do memory mapped IO.
In practice, file data (notably if it was accessed recently) often stays in the page cache, which is of paramount importance for performance. When that is not the case, your hardware (disk, controller, ...) becomes the bottleneck and your program becomes I/O bound (in that case, no software trick could improve significantly the performance).

pread/pwrite, buffers and disk cache

If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?
The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.

Reading file using fread in C

I lack formal knowledge in Operating systems and C. My questions are as follows.
When I try to read first single byte of a file using fread in C, does the entire disk block containing that byte is brought into memory or just the byte?
If entire block is brought into memory, what happens on reading
second byte since the block containing that byte is already in
memory?.
Is there significance in reading the file in size of disk blocks?
Where is the read file block kept in memory?
Here's my answers
More than 1 block, default caching is 64k. setvbuffer can change that.
On the second read, there's no I/O. The data is read from the disk cache.
No, a file is ussuly smaller than it's disk space. You'll get an error reading past the file size even if you're within the actual disk space size.
It's part of the FILE structure. This is implementation (compiler) specific so don't touch it.
The above caching is used by the C runtime library not the OS. The OS may or may not have disk caching and is a separate mechanism.

Efficient way to read/write large number of sparse arrays to disk in c

I need to write around 103 sparse double arrays to disk (one at a time) and read them individually later in the program.
EDIT: Apologies for not framing the question clearly earlier. To be specific I am looking to store as much as possible in memory and save the currently unused variables on the disk. I am working on linux.
The fastest way would be to buffer the I/O. Instead of writing each array individually, you'd first copy as many as you can to a buffer. Once that buffer is full you would write the entire buffer to disk, clear the buffer, and repeat. This minimizes the amount of writes that occur to the disk and will increase I/O efficiency.
If you plan on reading the arrays later in sequential order, I recommend you also buffer the reads so it reads more that it needs and you can work out of the buffer.
You could take it a step further and use asynchronous read/write operations so that your program can process other tasks while waiting on the disk.
If you are concerned about the size on disk it will consume, you can add another layer that will compress/uncompress the data stream as you write/read to and from the disk.
The HDF5 data format is meant to write large amount of data to disk efficiently.
This format is used by NASA and a large number of scientific applications :
http://www.hdfgroup.org/HDF5/
http://en.wikipedia.org/wiki/Hierarchical_Data_Format

Does "opening a file" mean loading it completely into memory?

There's an AudioFileOpenURL function which opens an file. With AudioFileReadPackets that file is accessed to read packets. But one thing that stucks in my brain is: Does AudioFileOpenURL actually load the whole monster into memory? Or is that a lightweight operation?
So is it possible to read data from a file, only a specific portion, without having the whole terabytes of stuff in memory?
Does AudioFileOpenURL actually load the whole monster into memory?
No, it just gets you a file pointer.
Or is that a lightweight operation?
Yep, fairly lightweight. Just requires a filesystem lookup.
So is it possible to read data from a file, only a specific portion, without having the whole terabytes of stuff in memory?
Yes, you can use fseek to go a certain point in the file, then fread to read it into a buffer (or AudioFileReadBytes).
No, it doesn't load the entire file into memory. "Opens a file" returns a handle to you allowing you to read from or write to a file.
I don't know about objective-c, but with most languages you open the file, and that just gives you the ability to THEN access the contents with a READ operation. In your case, you can perform a SEEK to move the file pointer to the desired location, then read the number of bytes you need.
AudioFileOpenURL will open(2) the file and read the necessary info (4096 bytes) to determine the audio type.
open(2) won't load the whole file into RAM.
(AudioFileOpenURL is a C API, not Objective-C.)

Resources