Loading thousand of files in same memory chunk C - c

Box: Linux, gcc.
Problem :
Finding out the file signature of an home folder, which contains thousand of items, by scanning this folder recursively.
Done so far:
Using mmap() system call to load the first 1k byte of each file and check the file magic number.
The drawback with that method is that for each file encountered i've got to make two system calls (e.g mmap() and munmap()).
Best solution if possible:
I would like to allocate a single chunk of memory, load each file (in a row) in this unique buffer and when processing is completed deallocate it, meaning that for each folder scanned i would only use two system calls.
I can't figure out which system call to use in order to achieve that, not even if this solution is realistic!
Any advice would be greatly appreciated.

Don't worry about performance until you know it isn't enough. Your time is much more valuable than the gains in program run time (except for extremely rare cases). And when the performance isn't enough, measure before digging in. There are numerous war stories on "performance optimizations" that were a complete waste (if not positivley harmful).

Related

Fortran: How do I allocate arrays when reading a file of unknown size?

My typical use of Fortran begins with reading in a file of unknown size (usually 5-100MB). My current approach to array allocation involves reading the file twice. First to determine the size of the problem (to allocate arrays) and a second time to read the data into those arrays.
Are there better approaches to size determination/array allocation? I just read about automatic array allocation (example below) in another post that seemed much easier.
array = [array,new_data]
What are all the options and their pros and cons?
I'll bite, though the question is teetering close to off-topicality. Your options are:
Read the file once to get the array size, allocate, read again.
Read piece-by-piece, (re-)allocating as you go. Choose the size of piece to read as you wish (or, perhaps, as you think is likely to be most speedy for your case).
Always, always, work with files which contain metadata to tell an interested program how much data there is; for example a block
header line telling you how many data elements are in the next
block.
Option 3 is the best by far. A little extra thought, and about one whole line of code, at the beginning of a project and so much wasted time and effort saved down the line. You don't have to jump on HDF5 or a similar heavyweight file design method, just adopt enough discipline to last the useful life of the contents of the file. For iteration-by-iteration dumps from your simulation of the universe, a home-brewed approach will do (be honest, you're the only person who's ever going to look at them). For data gathered at an approximate cost of $1M per TB (satellite observations, offshore seismic traces, etc) then HDF5 or something similar.
Option 1 is fine too. It's not like you have to wait for the tapes to rewind between reads any more. (Well, some do, but they're in a niche these days, and a de-archiving system will often move files from tape to disk if they're to be used.)
Option 2 is a faff. It may also be the worst performing but on all but the largest files the worst performance may be within a nano-century of the best. If that's important to you then check it out.
If you want quantification of my opinions run your own experiments on your files on your hardware.
PS I haven't really got a clue how much it costs to get 1TB of satellite or seismic data, it's a factoid invented to support an argument.
I would add to the previous answer:
If your data has a regular structure and it's possible to open it in a txt file, press ctrl+end substract header to the rows total and there it is. Although you may waste time opening it if it's very large.

random reading on very large files with fgets seems to bring Windows caching at it's limits

I have written a C/C++-program for Windows 7 - 64bit that works on very large files. In the final step it reads lines from an input-file (10GB+) and writes them to an output file. The access to the input-file is random, the writing is sequential.
EDIT: Main reason for this approach is to reduce RAM usage.
What I basically do in the reading part is this: (Sorry, very shortened and maybe buggy)
void seekAndGetLine(char* line, size_t lineSize, off64_t pos, FILE* filePointer){
fseeko64(filePointer, pos, ios_base::beg);
fgets(line, lineSize, filePointer);
}
Normally this code is fine, not to say fast, but under some very special conditions it gets very slow. The behaviour doesn't seem to be deterministic, since the performance drops occure on different machines at other parts of the file or even don't occure at all. It even goes so far, that the program totally stops reading, while there are no disc-operations.
Another sympthom seems to be the used RAM. My process keeps it's RAM steady, but the RAM used by the System grows sometimes very large. After using some RAM-Tools I found out, that the Windows Mapped File grows into several GBs. This behaviour also seems to depend on the hardware, since it occure on different machines at different parts of the process.
As far as I can tell, this problem doesn't exist on SSDs, so it definitely has something to do with the responsetime of the HDD.
My guess is that the Windows Caching gets somehow "wierd". The program is fast as long as the cache does it's work. But when Caching goes wrong, the behaviour goes either into "stop reading" or "grow cache size" and sometimes even both. Since I'm no expert for the windows caching algorithms, I would be happy to hear an explanation. Also, is there any way to get Windows out of C/C++ to manipulate/stop/enforce the caching.
Since I'm hunting this problem for a while now, I've already tried some tricks, that didn't work out:
filePointer = fopen(fileName, "rbR"); //Just fills the cache till the RAM is full
massive buffering of the read/write, to stop getting the two into each others way
Thanks in advance
Truly random access across a huge file is the worst possible case for any cache algorithm. It may be best to turn off as much caching as possible.
There are multiple levels of caching:
the CRT library (since you're using the f- functions)
the OS and filesystem
probably onboard the drive itself
If you replace your I/O calls via the f- functions in the CRT with the comparable ones in the Windows API (e.g., CreateFile, ReadFile, etc.) you can eliminate the CRT caching, which may be doing more harm than good. You can also warn the OS that you're going to be doing random accesses, which affects its caching strategy. See options like FILE_FLAG_RANDOM_ACCESS and possibly FILE_FLAG_NO_BUFFERING.
You'll need to experiment and measure.
You might also have to reconsider how your algorithm works. Are the seeks truly random? Can you re-sequence them, perhaps in batches, so that they're in order? Can you limit access to a relatively small region of the file at a time? Can you break the huge file into smaller files and then work with one piece at a time? Have you checked the level of fragmentation on the drive and on the particular file?
Depending on the larger picture of what your application does, you could possibly take a different approach - maybe something like this:
decide which lines you need from the input file and store the
line numbers in a list
sort the list of line numbers
read through the input file once, in order, and pull out the lines
you need (better yet, seek to next line and grab it, especially when there's big gaps)
if the list of lines you're grabbing is small enough, you can store
them in memory for reordering before output, otherwise, stick them
in a smaller temporary file and use that file as input for your
current algorithm to reorder the lines for final output
It's definitely a more complex approach, but it would be much kinder to your caching subsystem, and as a result, could potentially perform significantly better.

How do file systems track free space

I'm writing a program that needs to read and write lots of data in random order, and since I don't want to use hundreds of small files, I'm trying to develop a sort of virtual file system that writes to one large file that keeps track of where the "files" are in the "disk" file.
Thus, I've been trying to find detailed information about file system implementations, but theres one thing that never seems to be explained in a way I can understand: How does the file system track free/deleted sectors for new file creation? FAT, for example, has an index of all sectors at the beginning that seems to be the only place that holds this information, but searching the index for a new area of free space in a linear, O(n) fashion seems like it would be rather inefficient, especially if there are no deleted sectors and you have to insert something at the end of the list. Am I missing something, or is this is how file systems really detect unused sectors for writing? Thanks!
The answer depends on the overall file system architecture. It can be a linear list of free pages, or the free space can be counted in the same way as other files (eg. linked lists).
Practically developing an efficient file system is quite serious task for a side task that you have. So it makes sense to use some already created virtual file system, such as the one CodeBase offers or our Solid File System.
I found a helpful PDF explaining how free space is mapped in linux file systems. This is more along the lines of what I was looking for.
http://www.kernel.org/doc/ols/2010/ols2010-pages-121-132.pdf
it's just like a link list: each file may be seperated into many partitions and at the end of the partition each partition it's refering to begining of the next the same goes for the free speaces. think of free space as one large file that contains every byte which is not inside another file!

Performance issues in writing to large files?

I have been recently involved in handling the console logs for a server and I was wondering, out of curiosity, that is there a performance issue in writing to a large file as compared to small ones.
For instance is it a good idea to keep the log file size small instead of letting them grow bulky, but I was not able to argue much in favor of either approach.
There might be problems in reading or searching in the file, but right now I am more interested in knowing if writing can be affected in any way.
Looking for an expert advice.
Edit:
The way I thought it was that the OS only has to open a file handle and push the data to the file system. There is little correlation to the file size, since you have to keep on appending the data to the end of the file and whenever a block of data is full, OS will assign another block to the file. As I said earlier, there can be problems in reading and searching because of defragmentation of file blocks, but I could not find much difference while writing.
As a general rule, there should be no practical difference between appending a block to a small file (or writing the first block which is appending to a zero-length file) or appending a block to a large file.
There are special cases (like trying to fault in a triple-indirect block or the initial open having to read all mapping information) which could add additional I/O's. but the steady-state should be the same.
I'd be more worried about the manageability of having huge files: slow to backup, slow to copy, slow to view, etc.
I am not an expert, but I will try to answer anyway.
Larger files may take longer to write on disk and in fact it is not a programming issue. It is file system issue. Perhaps there are file systems, which does not have such issues, but on Windows large files cannot be write down in one piece so fragmenting them will take time (for the simple reason that head will have to move to some other cylinder). Assuming that we are talking about "classic" hard drives...
If you want an advice, I would go for writing down smaller files and rotating them either daily or when they hit some size (or both actually). That is rather common approach I saw in an enterprise-grade products.

Is it possible to delete both ends of a large file without copying?

I would like to know if it is possible, using Windows and c++, to take a large video file (several gigabytes in length) and delete the first and last few hundred megabytes of it “in-place”.
The traditional approach of copying the useful data to a new file often takes upwards of 20 minutes of seemingly needless copying.
Is there anything clever that can be done low-level with the disk to make this happen?
Sure, it's possible in theory. But if your filesystem is NTFS, be prepared to spend a few months learning about all the data structures that you'll need to update. (All of which are officially undocumented BTW.)
Also, you'll need to either
Somehow unmount the volume and make your changes then; or
Learn how to write a kernel filesystem driver, buy a license from MS, develop the driver and use it to make changes to a live filesystem.
It's a bit easier if your filesystem is something simpler like FAT32. But either way: in short, it might be possible, but even if it is it'll take years out of your life. My advice: don't bother.
Instead, look at other ways you could solve the problem: e.g. by using an avisynth script to serve just the frames from the region you are interested in.
Are you hoping to just fiddle around with sector addresses in the directory entry? It's virtually inconceivable that plan would work.
First of all, it would require that the amount of data you wish to delete be exactly a sector size. That's not very likely considering that there is probably some header data at the very start that must remain there.
Even if it mets those requirements, it would take a low-level modification, which Windows tries very hard to prevent you from doing.
Maybe your file format allows to 'skip' the bytes, so that you could simply write over (i.e. with memory mapping) the necessary parts. This would of course still use up unnecessarily much disk space.
Yes, you can do this, on NTFS.
The end you remove with SetFileLength.
The beginning, or any other large consecutive region of the file, you overwrite with zeros. You then mark the file "sparse", which allows the file system to reclaim those clusters.
Note that this won't actually change the offset of the data relative to the beginning of the file, it only prevents the filesystem from wasting space storing unneeded data.
Even if low level filesystem operations were easy, editing a video file is not simply a matter of deleting unwanted megabytes. You still do have to consider concepts such as compression, frames, audio and video muxing, media file containers, and many others...
Your best solution is to simply accept your idle twenty minutes.

Resources