C: Reading large files with limited memory - c

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.
Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).
I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.
A couple of things I have considered include:
Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
A few smarts around caching data that I expect to be accessed several times in a short amount of time
I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading.
I guess I was always wondering whether pread is even the right choice in this situation....
Any thoughts/opinions/criticisms/etc more than welcome.
NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

Related

Is saving a binary file a standard? Is it limited to only 1 type?

When should a programmer use .bin files? (practical examples).
Is it popular (or accepted) to save different data types in one file?
When iterating over the data in a file (that has several data types), the program must know the exact length of every data type, and I find that limiting.
If you mean for some idealized general purpose application data, text files are often preferred because they provide transparency to the user, and might also make it easier to (for instance) move the data to a different application and avoid lock-in.
Binary files are mostly used for performance and compactness reasons, encoding things as text has non-trivial overhead in both of these departments (today, perhaps mostly in size) which sometimes are prohibitive.
Binary files are used whenever compactness or speed of reading/writing are required.
Those two requirements are closely related in the obvious way that reading and writing small files is fast, but there's one other important reason that binary I/O can be fast: when the records have fixed length, that makes random access to records in the file much easier and faster.
As an example, suppose you want to do a binary search within the records of a file (they'd have to be sorted, of course), without loading the entire file to memory (maybe because the file is so large that it doesn't fit in RAM). That can be done efficiently only when you know how to compute the offset of the "midpoint" between two records, without having to parse arbitrarily large parts of a file just to find out where a record starts or ends.
(As noted in the comments, random access can be achieved with text files as well; it's just usually harder to implement and slower.)
I think when embedded developers see a ".bin" file, it's generally a flattened version of an ELF or the like, intended for programming as firmware for a processor. For instance, putting the Linux kernel into flash (depending on your bootloader).
As a general practice of whether or not to use binary files, you see it done for many reasons. Text requires parsing, and that can be a great deal of overhead. If it's intended to be usable by the user though, binary is a poor format, and text really shines.
Where binary is best is for performance. You can do things like map it into memory, and take advantage of the structure to speed up access. Sometimes, you'll have two binary files, one with data, and one with metadata, that can be used to help with searching through gobs of data. For example, Git does this. It defines an index format, a pack format, and an object format that all work together to save the history of your project is a readily accessible, but compact way.

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential.
EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
filePointer = fopen(fileName, "rbR"); //Just fills the cache till the RAM is full
massive buffering of the read/write, to stop getting the two into each others way
Thanks in advance
Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
the CRT library (since you're using the f- functions)
the OS and filesystem
probably onboard the drive itself
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?
Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
decide which lines you need from the input file and store the
line numbers in a list
sort the list of line numbers
read through the input file once, in order, and pull out the lines
you need (better yet, seek to next line and grab it, especially when there's big gaps)
if the list of lines you're grabbing is small enough, you can store
them in memory for reordering before output, otherwise, stick them
in a smaller temporary file and use that file as input for your
current algorithm to reorder the lines for final output
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

Performance issues in writing to large files?

I have been recently involved in handling the console logs for a server and I was wondering, out of curiosity, that is there a performance issue in writing to a large file as compared to small ones.
For instance is it a good idea to keep the log file size small instead of letting them grow bulky, but I was not able to argue much in favor of either approach.
There might be problems in reading or searching in the file, but right now I am more interested in knowing if writing can be affected in any way.
Looking for an expert advice.
Edit:
The way I thought it was that the OS only has to open a file handle and push the data to the file system. There is little correlation to the file size, since you have to keep on appending the data to the end of the file and whenever a block of data is full, OS will assign another block to the file. As I said earlier, there can be problems in reading and searching because of defragmentation of file blocks, but I could not find much difference while writing.
As a general rule, there should be no practical difference between appending a block to a small file (or writing the first block which is appending to a zero-length file) or appending a block to a large file.
There are special cases (like trying to fault in a triple-indirect block or the initial open having to read all mapping information) which could add additional I/O's. but the steady-state should be the same.
I'd be more worried about the manageability of having huge files: slow to backup, slow to copy, slow to view, etc.
I am not an expert, but I will try to answer anyway.
Larger files may take longer to write on disk and in fact it is not a programming issue. It is file system issue. Perhaps there are file systems, which does not have such issues, but on Windows large files cannot be write down in one piece so fragmenting them will take time (for the simple reason that head will have to move to some other cylinder). Assuming that we are talking about "classic" hard drives...
If you want an advice, I would go for writing down smaller files and rotating them either daily or when they hit some size (or both actually). That is rather common approach I saw in an enterprise-grade products.

Is it possible to delete both ends of a large file without copying?

I would like to know if it is possible, using Windows and c++, to take a large video file (several gigabytes in length) and delete the first and last few hundred megabytes of it “in-place”.
The traditional approach of copying the useful data to a new file often takes upwards of 20 minutes of seemingly needless copying.
Is there anything clever that can be done low-level with the disk to make this happen?
Sure, it's possible in theory. But if your filesystem is NTFS, be prepared to spend a few months learning about all the data structures that you'll need to update. (All of which are officially undocumented BTW.)
Also, you'll need to either
Somehow unmount the volume and make your changes then; or
Learn how to write a kernel filesystem driver, buy a license from MS, develop the driver and use it to make changes to a live filesystem.
It's a bit easier if your filesystem is something simpler like FAT32. But either way: in short, it might be possible, but even if it is it'll take years out of your life. My advice: don't bother.
Instead, look at other ways you could solve the problem: e.g. by using an avisynth script to serve just the frames from the region you are interested in.
Are you hoping to just fiddle around with sector addresses in the directory entry? It's virtually inconceivable that plan would work.
First of all, it would require that the amount of data you wish to delete be exactly a sector size. That's not very likely considering that there is probably some header data at the very start that must remain there.
Even if it mets those requirements, it would take a low-level modification, which Windows tries very hard to prevent you from doing.
Maybe your file format allows to 'skip' the bytes, so that you could simply write over (i.e. with memory mapping) the necessary parts. This would of course still use up unnecessarily much disk space.
Yes, you can do this, on NTFS.
The end you remove with SetFileLength.
The beginning, or any other large consecutive region of the file, you overwrite with zeros. You then mark the file "sparse", which allows the file system to reclaim those clusters.
Note that this won't actually change the offset of the data relative to the beginning of the file, it only prevents the filesystem from wasting space storing unneeded data.
Even if low level filesystem operations were easy, editing a video file is not simply a matter of deleting unwanted megabytes. You still do have to consider concepts such as compression, frames, audio and video muxing, media file containers, and many others...
Your best solution is to simply accept your idle twenty minutes.

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?
On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.
Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.
NTFS can do something like this with it's sparse file support but it's generaly not that useful.
I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)
Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.
There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated
may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Resources