how much time for opening a file? [closed] - file

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
in my program, i'm using file.open(path_to_file); .
in the server side, i have a directory that contains plenty of files, and i'm afraid that the program will take longer time to run if the directory is more and more bigger because of the file.open();
//code:
ofstream file;
file.open("/mnt/srv/links/154");//154 is the link id and in directory /mnt/srv/links i have plenty of files
//write to file
file.close();
Question: can the time to excecute file.open() vary according to the number of files in the directory?
I'm using debian, and I believe my filesystem is ext3.

I'm going to try to answer this - however, it is rather difficult, as it would depend on, for example:
What filesystem is used - in some filesystems, a directory consists of an unsorted list of files, in which case the time to find a particular file is O(n) - so with 900000 files, it would be a long list to search. On the other hand, some others use a hash algorithm or a sorted list, allow O(1) and O(log2(n)) respectively - of course, each part of a directory has to be found individually. With a number of 900k, O(n) is 900000 times slower than O(1), and O(log2(n)) for 900k is just under 20, so 18000 times "faster". However, with 900k files, even a binary search may take some doing, because if we have a size of each directory entry of 100 bytes [1], we're talking about 85MB of directory data. So it will be several sectors to read in, even if we only touch at 19 or 20 different places.
The location of the file itself - a file located on my own hard-disk will be much quicker to get to than a file on my Austin,TX colleague's file-server, when I'm in England.
The load of any file-server and comms links involved - naturally, if I'm the only one using a decent setup of a NFS or SAMBA server, it's going to be much quicker than using a file-server that is serving a cluster of 2000 machines that are all busy requesting files.
The amount of memory and overall memory usage on the system with the file, and/or the amount of memory available in the local machine. Most modern OS's will have a file-cache locally, and if you are using a server, also a file-cache on the server. More memory -> more space to cache things -> quicker access. Particularly, it may well cache the directory structure and content.
The overall performance of your local machine. Although nearly all of the above factors are important, the simple effort of searching files may well be enough to make some difference with a huge number of files - especially if the search is linear.
[1] A directory entry will have, at least:
A date/time for access, creation and update. With 64-bit timestamps, that's 24 bytes.
Filesize - at least 64-bits, so 8 bytes
Some sort of reference to where the file is - another 8 bytes at least.
A filename - variable length, but one can assume an average of 20 bytes.
Access control bits, at least 6 bytes.
That comes to 66 bytes. But I feel that 100 bytes is probably more typical.

Yes, it can. That depends entirely on the filesystem, not on the language. The times for opening/reading/writing/closing files are all dominated by the times of the corresponding syscalls. C++ should add relatively little overhead, even though you can get surprises from your C++ implementation.

There are a lot of variables which might affect the answer to this, but the general answer is that the number of files will influence the time taken to open a file.
The biggest variable is the filesystem used. Modern filesystems use directory index structures such as B-Trees, to allow searching for known files to be a relatively fast operation. On the other hand, listing all the files in the directory or searching for subsets using wildcards can take much longer.
Other factors include:
Whether symlinks need to be traversed to identify the file
Whether the file is local or mounter over a network
Cacheing
In my experience, using a modern filesystem, an individual file can be located in directories containing 100's of thousands of files in times less than a second.

Related

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?
Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?
If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.
If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Does a large number of directories negatively impact performance?

I'm intending to use files to store data as a kind of cache for PHP-generated files, so as to avoid having to re-generate them every time they are loaded (their contents only change once a day).
One thing I have noticed in the past is that if a directory has a large number of files inside, reaching the thousands, it will take a long time for an FTP program to load its contents, sometimes even crashing the computer that's trying to load them. So I'm looking into a tree-based system, where each file is stored in a subfolder based on its ID. So for example a file with the ID 123456 would be stored as 12/34/56.html. In this way, each folder will have at most 100 items (except in the event that there are millions of files, but that is extremely unlikely to happen).
Is this a good idea, is it overkill, or is it unnecessary? The question essentially boils down to: "What is the best way to organise a large number of files?"
1) The answer depends on
OS (e.g. Linux vs Windows) and
filesystem (e.g. ext3 vs NTFS).
2) Keep in mind that when you arbitrarily create a new subdirectory, you're using more inodes
3) Linux usually handles "many files/directory" better than Windows
4) A couple of additional links (assuming you're on Linux):
200,000 images in single folder in linux, performance issue or not?
https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory
https://serverfault.com/questions/147731/do-large-folder-sizes-slow-down-io-performance
http://tldp.org/LDP/intro-linux/html/sect_03_01.html
*

write one big file or multiple small files

I'm wondering what is better in therms of performance: write in one big text file (something about 10GB or more) or use a subfolder system that will have 3 levels with 256 folders in each one, the last level will be the text file. Example:
1
1
2
3
1
2
3
4
4
2
3
4
It will be heavy accessed (will be opened, append some text stuff then closed), so I don't know what is better, open and close file pointers thousand times in a second, or change a pointer inside one big file thousand times.
I'm at a core i7, 6GB DDR3 of RAM and a 60MB/s write disk speed under ext4.
You ask a fairly generic question, so the generic answer would be to go with the big file, access it and let the filesystem and its caches worry about optimizing access. Chances are they came up with a more advanced algorithm than you just did (no offence).
To make a decision, you need to know answers to many questions, including:
How are you going to determine which of the many files to access for the information you are after?
When you need to append, will it be to the logical last file, or to the end of whatever file the information should have been found in?
How are you going to know where to look in any given file (large or small) for where the information is?
Your 2563 files (16 million or so if you use all of them) will require a fair amount of directory storage.
You actually don't mention anything about reading the file - which is odd.
If you're really doing write only access to the file or files, then a single file always opened with O_APPEND (or "a") will probably be best. If you are updating (as well as appending) information, then the you get into locking issues (concurrent access; who wins).
So, you have not included enough information in the question for anyone to give any definitive answer. If there is enough more information in the comments you've added, then you should have placed those comments into the question (edit the question; add the comment material).

Fastest file access/storage?

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!
This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.
That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.
This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt
This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.
First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...
Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.
Why isn't storing the paths in a database table acceptable?
My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.
I'd check out hadoops model.
P
I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)

How can I predict the size of an ISO 9660 filesystem?

I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.
I could use genisoimage -print-size except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.
Clarifying the problem and the question:
I need to burn archives that split across multiple DVDs, typically around five at a time. The problem I'm trying to solve is to decide which files to put on each DVD, so that each DVD (except the last) is as full as possible. This problem is NP-hard.
I'm using the standard greedy packing algorithm where you place the largest file first and you put it in the first DVD having sufficient room. So j_random_hacker, I am definitely not starting from random. I start from sorted and use Bubblesearch to perturb the order in which the files are packed. This procedure improves my packing from around 80% of estimated capacity to over 99.5% of estimated capacity. This question is about doing a better job of estimating the capacity; currently my estimated capacity is lower than real capacity.
I have written a program that tries 10,000 perturbations, each of which involves two steps:
Choose a set of files
Estimate how much space those files will take on DVD
Step 2 is the step I'm trying to improve. At present I am "erring on the side of caution" as Tyler D suggests. But I'd like to do better. I can't afford to use genisomage -print-size because it's too slow. Similarly, I can't tar the files to disk, because on only is it too slow, but a tar file is not the same size as an ISO 9660 image. It's the size of the ISO 9660 image I need to predict. In principle this could be done with complete accuracy, but I don't know how to do it. That's the question.
Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage is going to be a good solution---although it might be a good way to learn a model of the filesystem
that tells me how how it works. Knowing that block size is 2KB is already helpful.
It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip.)
Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.
As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:
Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
You need to read the readily available ISO 9660 specs and build such a tool yourself.
Actually, there is a third answer:
(3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation nx + y = iso_size - total_input_size on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.
I'm not sure exactly how you are currently doing this -- according to my googling, "Bubblesearch" refers to a way to choose an ordering of items that is in some sense near a greedy ordering, but in your case, the order of adding files to a DVD does not change the space requirements so this approach wastes time considering multiple different orders that amount to the same set of files.
In other words, if you are doing something like the following to generate a candidate file list:
Randomly shuffle the list of files.
Starting from the top of the list, greedily choose all files that you estimate will fit on a DVD until no more will.
Then you are searching the solution space inefficiently -- for any final candidate set of n files, you are potentially considering all n! ways of producing that set. My suggestion:
Sort all files in decreasing order of file size.
Mark the top (largest) file as "included," and remove it from the list. (It must be included on some DVD, so we might as well include it now.)
Can the topmost file in the list be included without the (estimated) ISO filesystem size exceeding the DVD capacity? If so:
With probability p (e.g. p = 0.5), mark the file as "included".
Remove the topmost file from the list.
If the list is now empty, you have a candidate list of files. Otherwise, goto 3.
Repeat this many times and choose the best file list.
Tyler D's suggestion is also good: if you have ~40000 files totalling ~500Mb, that means an average file size of 12.5Kb. ISO 9660 uses a block size of 2Kb, meaning those files are wasting on average 1Kb of disk space, or about 8% of their size. So packing them together with tar first will save around 8% of space.
Can't use tar to store the files on disk?
It's unclear if you're writing a program to do this, or simply making some backups.
Maybe do some experimentation and err on the side of caution - some free space on a disk wouldn't hurt.
Somehow I imagine you've already considered these, or that my answer is missing the point.
I recently ran an experiment to find a formula to do a similar filling estimate on dvds, and found a simple formula given some assumptions... from your original post this formula will likely be a low number for you, it sounds like you have multiple directories and longer file names.
Assumptions:
all the files are exactly 8.3 characters.
all the files are in the root directory.
no extensions such as Joliet.
The formula:
174 + floor(count / 42) + sum( ceil(file_size / 2048) )
count is the number of files
file_size is each file's size in bytes
the result is in 2048 byte blocks.
An example script:
#!/usr/bin/perl -w
use strict;
use POSIX;
sub sum {
my $out = 0;
for(#_) {
$out += $_;
}
return $out;
}
my #sizes = ( 2048 ) x 1000;
my $file_count = #sizes;
my $data_size = sum(map { ceil($_ / 2048) } #sizes);
my $dir_size = floor( $file_count / 42 ) + 1;
my $overhead = 173;
my $size = $overhead + $dir_size + $data_size;
$\ = "\n";
print $size;
I verified this on disks with up to 150k files, with sizes ranging from 200 bytes to 1 MiB.
Nice thinking, J. Random. Of course I don't need every last byte, this is mostly for fun (and bragging rights at lunch). I want to be able to type du at the CD-ROM and have it very close to 4700000000.
I looked at the ECMA spec but like most specs it's medium painful and I have no confidence in my ability to get it right. Also it appears not to discuss Rock Ridge extensions, or if it does, I missed it.
I like your idea #3 and think I will carry it a bit further: I'll try to build a fairly rich model of what's going on and then use genisoimage -print-size on a number of filesets to estimate the parameters of the model. Then I can use the model to do my estimation. This is a hobby project so it will take a while, but I will get around to it eventually. I will post an answer here to say how much wastage is eliminated!

Resources