Best approach for reading thousands of small files in a webservice

Best approach for reading thousands of small files in a webservice - database

I have a project where at some point, I have a list of filenames (around 1000). Then I have a loop where for each iteration, I read 2 files per filename and then I do some calculus with these files. Files are closed (there's no writing method) and next iteration is performed. These 2 file have always a similar size: One is about 400 bytes, and the other one is between 50 and 100 kb.
This code will run as a request of an API in a web service, and therefore there can be many request in parallel (but as it is my code, two exactly same files won't be given to parallel requests).
Right now I'm using an SSD to store the files. However, given the amount of files to read per request, I think there could be problems of IOPS. Is my concern founded? If I am correct, do you think that in this scenario it is a good idea to store these 400 byte files as blobs? Would you do the same with the 100 kb images? Regarding the 100 kb I cannot read all of them at once since that would mean to store in memory 6 GB, but I could do several calls to my database. Do you think that in that case it would be a good idea? The total amount of filenames avalaible are about 150.000-200.000. (Therefore the double for number of files)
Finally, to have these 1000 filenames I look in a database, in one table. Would you create a table for these files in this database? Or would you put them in a different column of the table where filenames are? Maybe better in a different database? (I'm using PostgreSQL in a RDS AWS)

Related

how much time for opening a file? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
in my program, i'm using file.open(path_to_file); .
in the server side, i have a directory that contains plenty of files, and i'm afraid that the program will take longer time to run if the directory is more and more bigger because of the file.open();
//code:
ofstream file;
file.open("/mnt/srv/links/154");//154 is the link id and in directory /mnt/srv/links i have plenty of files
//write to file
file.close();
Question: can the time to excecute file.open() vary according to the number of files in the directory?
I'm using debian, and I believe my filesystem is ext3.

I'm going to try to answer this - however, it is rather difficult, as it would depend on, for example:
What filesystem is used - in some filesystems, a directory consists of an unsorted list of files, in which case the time to find a particular file is O(n) - so with 900000 files, it would be a long list to search. On the other hand, some others use a hash algorithm or a sorted list, allow O(1) and O(log2(n)) respectively - of course, each part of a directory has to be found individually. With a number of 900k, O(n) is 900000 times slower than O(1), and O(log2(n)) for 900k is just under 20, so 18000 times "faster". However, with 900k files, even a binary search may take some doing, because if we have a size of each directory entry of 100 bytes [1], we're talking about 85MB of directory data. So it will be several sectors to read in, even if we only touch at 19 or 20 different places.
The location of the file itself - a file located on my own hard-disk will be much quicker to get to than a file on my Austin,TX colleague's file-server, when I'm in England.
The load of any file-server and comms links involved - naturally, if I'm the only one using a decent setup of a NFS or SAMBA server, it's going to be much quicker than using a file-server that is serving a cluster of 2000 machines that are all busy requesting files.
The amount of memory and overall memory usage on the system with the file, and/or the amount of memory available in the local machine. Most modern OS's will have a file-cache locally, and if you are using a server, also a file-cache on the server. More memory -> more space to cache things -> quicker access. Particularly, it may well cache the directory structure and content.
The overall performance of your local machine. Although nearly all of the above factors are important, the simple effort of searching files may well be enough to make some difference with a huge number of files - especially if the search is linear.
[1] A directory entry will have, at least:
A date/time for access, creation and update. With 64-bit timestamps, that's 24 bytes.
Filesize - at least 64-bits, so 8 bytes
Some sort of reference to where the file is - another 8 bytes at least.
A filename - variable length, but one can assume an average of 20 bytes.
Access control bits, at least 6 bytes.
That comes to 66 bytes. But I feel that 100 bytes is probably more typical.

Yes, it can. That depends entirely on the filesystem, not on the language. The times for opening/reading/writing/closing files are all dominated by the times of the corresponding syscalls. C++ should add relatively little overhead, even though you can get surprises from your C++ implementation.

There are a lot of variables which might affect the answer to this, but the general answer is that the number of files will influence the time taken to open a file.
The biggest variable is the filesystem used. Modern filesystems use directory index structures such as B-Trees, to allow searching for known files to be a relatively fast operation. On the other hand, listing all the files in the directory or searching for subsets using wildcards can take much longer.
Other factors include:
Whether symlinks need to be traversed to identify the file
Whether the file is local or mounter over a network
Cacheing
In my experience, using a modern filesystem, an individual file can be located in directories containing 100's of thousands of files in times less than a second.

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?

Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?

If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.

If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

Does a large number of directories negatively impact performance?

I'm intending to use files to store data as a kind of cache for PHP-generated files, so as to avoid having to re-generate them every time they are loaded (their contents only change once a day).
One thing I have noticed in the past is that if a directory has a large number of files inside, reaching the thousands, it will take a long time for an FTP program to load its contents, sometimes even crashing the computer that's trying to load them. So I'm looking into a tree-based system, where each file is stored in a subfolder based on its ID. So for example a file with the ID 123456 would be stored as 12/34/56.html. In this way, each folder will have at most 100 items (except in the event that there are millions of files, but that is extremely unlikely to happen).
Is this a good idea, is it overkill, or is it unnecessary? The question essentially boils down to: "What is the best way to organise a large number of files?"

1) The answer depends on
OS (e.g. Linux vs Windows) and
filesystem (e.g. ext3 vs NTFS).
2) Keep in mind that when you arbitrarily create a new subdirectory, you're using more inodes
3) Linux usually handles "many files/directory" better than Windows
4) A couple of additional links (assuming you're on Linux):
200,000 images in single folder in linux, performance issue or not?
https://serverfault.com/questions/43133/filesystem-large-number-of-files-in-a-single-directory
https://serverfault.com/questions/147731/do-large-folder-sizes-slow-down-io-performance
http://tldp.org/LDP/intro-linux/html/sect_03_01.html
*

write one big file or multiple small files

I'm wondering what is better in therms of performance: write in one big text file (something about 10GB or more) or use a subfolder system that will have 3 levels with 256 folders in each one, the last level will be the text file. Example:
1
1
2
3
1
2
3
4
4
2
3
4
It will be heavy accessed (will be opened, append some text stuff then closed), so I don't know what is better, open and close file pointers thousand times in a second, or change a pointer inside one big file thousand times.
I'm at a core i7, 6GB DDR3 of RAM and a 60MB/s write disk speed under ext4.

You ask a fairly generic question, so the generic answer would be to go with the big file, access it and let the filesystem and its caches worry about optimizing access. Chances are they came up with a more advanced algorithm than you just did (no offence).

To make a decision, you need to know answers to many questions, including:
How are you going to determine which of the many files to access for the information you are after?
When you need to append, will it be to the logical last file, or to the end of whatever file the information should have been found in?
How are you going to know where to look in any given file (large or small) for where the information is?
Your 2563 files (16 million or so if you use all of them) will require a fair amount of directory storage.
You actually don't mention anything about reading the file - which is odd.
If you're really doing write only access to the file or files, then a single file always opened with O_APPEND (or "a") will probably be best. If you are updating (as well as appending) information, then the you get into locking issues (concurrent access; who wins).
So, you have not included enough information in the question for anyone to give any definitive answer. If there is enough more information in the comments you've added, then you should have placed those comments into the question (edit the question; add the comment material).

Dataset size limit as an xml file

We are currently using DataSet for loading and saving our data to an xml file using Dataset and there is a good possibility that the size of the xml file could get very huge.
Either way we are wondering if there is any limit on the size for an xml file so the Dataset would not run into any issues in the future due to the size of it. Please advise.
Thanks
N

well, OS max file size in one thing to consider.
(although modern os won't have this problem).
old OS support only 2 GB per file if I recall right.
also - the time that you will need to waist on updating the file is enourmouse.
if you're going for a very very large file, use a small DB (mysql or sqlexpress or sqLite)

DataSets are stored in memory so the limit should be somewhere around the amount of memory the OS can address for custom processes.

While doing work for a prior client of reading and parsing large XML files over 2 gig per XML file, the system choked while trying to use an XML reader. While working w/Microsoft, we ultimately were passed on to the person who wrote the XML engine. His recommendation was for us to read and process it in smaller chunks, that it couldn't handle loading the entire thing into memory at one time. However, if you are trying to WRITE XML as a stream to a final output .XML file, you should be good to go on most current OS that support over 2 gig.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight