Selecting a file randomly from a file system - c

This question relates to Simulating file system access .
I need to choose files and directories randomly to act as arguments of file operations like rename, write, read etc. What I was planning to do was to make a list of all files and directories with thier paths and randomly select from this list. But, as files and directories are created and deleted in the actual file system, the list also has to be updated. I am finding maintaining the list and updating it in this manner to be inefficient and it also has to be atomic so that a later operation does not access a file that was deleted by a previous operation.
Can you suggest a different way of selecting the files ..maybe someway to do it diretly from the file system...but how would we know paths to files then.
Thanks
I found something interesting here Randomly selecting a file from a tree of directories in a completely fair manner specially in
Michael J. Barber's answer, but not being able to follow it completely due to my python ignorance

You don't want to try to maintain a list of files when the filesystem is right there. You should be able to do it right from C. Walk from the root directory, selecting a random file from it. You can pick a random maximum depth, and if you hit a regular file, at or before this, use it. If it's a directory, repeat up to max depth. If it's a special file, maybe start over.
This should be pretty quick. The operation shouldn't have to be atomic. If the file's not there when you want do your operation, try again. Shouldn't be too complicated. You can build the path up as you find your target file. This will be simpler than fussing with the fs directly (I assume you meant at a much lower level) and should be simple to implement.

Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)
For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.
Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)
Pick a random number in the range [0, max files).
Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.
Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)
Repeat steps 2-5 as needed.
Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.
Happy coding.

All you should know is how many files are in each directory in order to pick directory in which you should traverse. Avoid traversing over symbolic links and counting files in symbolic links.
You can use similar solution as pst described.
Example you have 3 directories and there are 20,40 and 1000 files in each.
You make total [20,60,1060] and you pick random number 0-1060. if this number if greater or equal 60 you go 3rd folder.
You stop traversing once you reach folder whitout folders.
To find random file trough this path you can apply same trick as before.
This way you will pick any file whit equal probability.

Related

edit all files via multi-threading in C

If you had a base file directory with an unknown amount of files and additional folders with files in them and needed to rename every file to append the date it was created on,
i.e filename.ext -> filename_09_30_2021.ext
Assuming the renaming function was already created and returned 1 on success, 0 on fail and -1 on error,
int rename_file(char * filename)
I'm having trouble understanding how you would write the multi-threaded file parsing section to increase the speed.
Would it have to first break down the entire file tree into say 4 parts of char arrays with filenames and then create 4 threads to tackle each section?
Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?
Wouldn't that be counterproductive and slower than a single thread going down the file tree and renaming files as it finds them instead of listing them for any multi-threading?
In general, you get better performance from multithreading for intensive cpu operations. In this case, you'll probably see little to no improvement. It's even quite possible that it gets slower.
The bottleneck here is not the cpu. It's reading from the disk.
Related: An answer I wrote about access times in general https://stackoverflow.com/a/45819202/6699433

Why is ".git/objects/17" special? Why git source code determines this naming as insane?

In this source code,
https://github.com/git/git/commit/07af88913662f1179ba34b92370a6df24263ae5f
if (sizeof(path) <= snprintf(path, sizeof(path), "%s/17", objdir)) {
warning(_("insanely long object directory %.*s"), 50, objdir);
return 0;
}
dir = opendir(path);
commiter says
We probe the "17/" loose object directory for auto-gc, and
use a local buffer to format the path
but, why?
I searched a hell, but cannot understand.
Git uses SHA-1 or SHA-256 to identify objects, and these hashes are designed to be indistinguishable from random. Therefore, the number of objects that are stored in one directory is, on average, 1/256th of the total number of objects. It would be inefficient to traverse all possible objects, so as a result, Git picks just one directory, reads the number of objects, and extrapolates by multiplying by 256.
Any option here is fine, and picking a fixed value simplifies the tests very significantly, since they can then be deterministic. The Git developers (and users in general) like deterministic behavior because it's easier to reason about, and picking a directory at random could result in different behavior between two no-op commands where one runs a GC and the other doesn't.
As to why directory 17 and not 42 or ff? Because Junio, the author of that patch and the current maintainer, chose 17 and it has some special meaning to him which he hasn't chosen to share. Even as a Git contributor, I don't know why he chose it (and I'm not aware of any other contributors besides Junio who know, either) and pressuring him about his reasons would be rude and insensitive, so we haven't done that.
Thanks to #JoachimSauer I got it.
They just pick one directory to find a size.
It's enough because the objects are spread evenly. If directory doesn't exist, you can skip because it's already small size.

Reconstruct version control from set of files

I am looking after an approach for the following task:
given a set of files that are highly similar (I am using Fuzzy hashing here), I would like to know if there is an algorithm that allows to label those files with a version number. The output should return the sequential order of when those files have been generated.
The reason is I have to re-organize data of a team who were not familiar with version control.
Thank you
A fairly simple approach (I hope) would be to try and convert this into some kind of graph problem.
Let's say every file is a node with edges between every two files.
The weight of an edge between two nodes would be, for instance, the number of different lines between the files (or some some other function).
What you do next is find a non-cyclic path that traverses all files with the minimum cost. something like this, if you know the first file and the last.
You could add an empty file and the latest version you have as your start and end nodes.
I'm guessing this won't give you the exact result, but it'll probably give you a good starting point.
Hope this is helpful.

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Uniquely identify files/folders in NTFS, even after move/rename

I haven't found a backup (synchronization) program which does what I want so I'm thinking about writing my own.
What I have now does the following: It goes through the data in the source and for every file which has its archive bit set OR does not exist in the destination, copies it to the destination, overwriting a possibly existing file. When done, it checks for all files in the destination if it exists in the source, and if it doesn't, deletes it.
The problem is that if I move or rename a large folder, it first gets copied to the destination even though it is in principle already there, just has a different path. Then the folder which was already there is deleted afterwards.
Apart from the unnecessary copying, I frequently run into space problems because my backup drive isn't large enough to hold the original data twice.
Is there a way to programmatically identify such moved/renamed files or folders, i.e. by NTFS ID or physical location on media or something else? Are there solutions to this problem?
I do not care about the programming language, but hints for doing this with Python, C++, C#, Java or Prolog are appreciated.
Are you familiar with object IDs? This might be what you're looking for: http://msdn.microsoft.com/en-us/library/aa363997.aspx
You may also want to use file IDs. You can get this from the FileId field of FILE_ID_BOTH_DIR_INFO you get by calling GetFileInformationByHandleEx or the nFileIndexLow and nFileIndexHigh fields of BY_HANDLE_FILE_INFORMATION you get by calling GetFileInformationByHandle.
Although it would require you to redesign your system, NTFS has a feature called a change journal that was designed for just this situation. It keeps track of every file that was changed, even across reboots. When your program runs, it would read the change journal from whenever it left off. For every file that was deleted, delete that file on your backup. For every file that was renamed, rename that file on your backup. For every file that was created or changed, copy that file to your backup. Now, instead of having to traverse both directory trees in parallel, you can simply traverse the list of files you'll actually have to pay attention to.
Not sure about NTFS specifics which might help you but didn't you think about comparing file hashes? And in order not to calculate hash many times you can firstly compare file sizes.

Resources