I am looking to store many (500 million - 900 million) small (2kB-9kB) files.
Most general purpose filesystems seem unfit for this as they are either unable to handle the sheer number of files, slow down with many files or have exceedingly large block sizes.
This seems to be a common problem, however all solutions I could find seem to end up just accepting a hit to storage efficiency when storing small files on inodes roughly the same size as themselves.
thus
Are there any filesystems specifically designed to handle hundreds of millions of small files?
or
Is there a production level solution for archiving the small files on the fly and writing one large file to disk?
Our SolFS supports page sizes of as little as 512 bytes and lets you create a virtual file system in a file, thus combining all of your files into one storage file. Performance, though, depends on how files are stored (hierarchically or in one folder), and is in general specific to usage scenarios.
Related
I am currently in the process of designing a simple repository that uses the file system to store documents. There is the future potential for millions of files and the strategy that I want to use to map an ID to a location on disk is a means of hashing the ID and using part of the hash to determine the directory it should live in.
A common operation will be reading through all of the files per folder and any of it's nested folders.
My question is: is there an ideal ratio of files per directory? I have
the means to control this ratio via the ID -> location algorithm. Any
data to back answers up would be great.
If performance is what you're worrying about, this will depend the
type of filesystem you are using. Older filesystems like
ext2 kept
directory entries in a linear list. Looking up a particular file in a
directory could be very expensive.
Modern filesystems such as ext4,
btrfs,
xfs
and others typically have indexed directories, the access time of a
single file in a huge directory isn't going to be appreciably
different from accessing a single file in a small directory. In fact,
spreading millions of files over many subdirectories may give you
slower lookup performance than having them all in a single directory!
If you are writing your own software that will do a lot of linear
scans of the entire set of files or access individual files by name,
it probably doesn't matter which way you go about it (as long as you
access it the right
way.
I would worry more about managing the file system outside of the
application. Typical system utilities (like ls) may use readdir() or
linear scans of directories. To prevent the sysadmins from having
terrible headaches when diagnosing issues within the directory
structure, I'd go with something agreeably bushy, and 10k-20k entries
per directory (assuming indexed directories) would work.
When choosing your layout, you may wish to watch out for limits on the
number of subdirectories allowed per directory (i.e. 64000 on ext4).
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I want to make a program to keep record of students in a university and provide a search method.
Which of the following methods will be faster?
Make one file for each student
Make single data file and search?
The size of student data will be different for different student.
It is operating system and file system specific. Some general hints (with an implicit focus on Linux systems, with good enough file systems like Ext4, BTRFS, etc...).
Consider using a database, perhaps just Sqlite, or a DBMS like PostGreSQL or MongoDB (indexing is essential for performance)
Your question depends upon the data size. If you are sure it is small enough to easily fit in RAM (e.g. less than a hundred megabytes on a recent laptop or desktop), you could serialize and deserialize all the data using some textual format like JSON. On the contrary, if you are sure that you have data-center sized data (several petabytes) things are very different.
In general, avoid having many tiny files, e.g. dozen of thousands of kilobyte sized files. Prefer having fewer but bigger files (but if possible avoid huge file sizes, e.g. terabytes or larger than half of your biggest disk or partition; see however LVM).
Perhaps an indexed file library like GDBM would be worthwhile.
If you need to have a lot of files, put them in subdirectories: so dir01/subdir02/file0345.txt is better than file01020345.txt; avoid having large directories with e.g. more than a thousand files. (See figure on wikipage about Ext2 to understand why).
You might have a mixed approach: small contents (e.g. less than megabyte) in some database, large contents in files (with some metadata in a database). Read also about binary large objects (BLOBs).
Read also about application checkpointing or persistence
define, implement and test some backup and some restore (human) procedures. For a (not huge) database, dump it in textual format (e.g. SQL).
So your work should start with an estimate of the size of the data, and how (and how often) is it accessed (and changed).
If it is homework and you are not allowed to use external libraries, you should organize a file into fixed size records (accessed randomly, e.g. with fseek(3) or lseek(2)) - probably coded as some tagged union and care about indexing (using e.g. hash-tables or B-tree techniques). You might need to manage linked lists of several low-level records to handle large data.
Studying the implementation of sqlite or of GDBM (both are free software, you should download and study their source code) will be inspirational.
Notice that most universities have only a few dozen thousands of students, and I guess that each student would need a few (or a dozen of) kilobytes (unless you want to store photos or videos of every student!) for identity, scores and courses information. So in practice you probably need only several dozens of megabytes (maybe two gigabytes), and that fits today in RAM.
I Currently Have 3 Fileserver each has a raid 6 array of 24 disks.
The Question is this is there any way to make them work as one big drive rather that 3 seperate systems. I need more throughput and i was thinking this was a possibilty. Maybe a Distrubted Filesystem like Hadoop?
The answer depends on the intended usage of the data on this hardware.
Hadoop file system HDFS - is something suited for very special need of the Map-Reduce processing. Main limitations, which are ok for its intended use, but problematic for others are:
a) Files can not be edited, but only appended.
b) There will be a problem to stoe many small files. It is designed for file of size 64 MB and more. The cause of this limitation that all metadata is stored in memory.
c) It is not posix compliant FS, so you can not mount it and use as regular file system by the application unaware of HDFS.
I would consider options like GlusterFS, Ceph or Lustre which are built for the cases similar to one you describe. More information is needed to give good advice of selecting one of them.
I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.
Suggestions?
Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?
It all depends on the effective amount of text and/or web pages you intent to crawl. A generic solution is probably to
use an RDBMS (SQL server of sorts) to store the meta-data associated with the pages.
Such info would be stored in a simple table (maybe with a very few support/related tables) containing fields such as Url, FileName (where you'll be saving it), Offset in File where stored (the idea is to keep several pages in the same file) date of crawl, size, and a few other fields.
use a flat file storage for the text proper.
The file name and path matters little (i.e. the path may be shallow and the name cryptic/automatically generated). This name / path is stored in the meta-data. Several crawled pages are stored in the same flat file, to optimize the overhead in the OS to manage too many files. The text itself may be compressed (ZIP etc.) on a per-page basis (there's little extra compression gain to be had by compressing bigger chunks.), allowing a per-file handling (no need to decompress all the text before it!). The decision to use compression depends on various factors; the compression/decompression overhead is typically relatively minimal, CPU-wise, and offers a nice saving on HD Space and generally disk I/O performance.
The advantage of this approach is that the DBMS remains small, but is available for SQL-driven queries (of an ad-hoc or programmed nature) to search on various criteria. There is typically little gain (and a lot of headache) associated with storing many/big files within the SQL server itself. Furthermore as each page gets processed / analyzed, additional meta-data (such as say title, language, most repeated 5 words, whatever) can be added to the database.
Having it in a database will help search through the content and page matadata. You can also try in-memory databases or "memcached" like storage to speed in up.
Depending on the processing power of the PC which will do the data mining, you could add the scraped data to a compressible archive like a 7zip, zip, or tarball. You'll be able to keep the directory structure intact and may end up saving a great deal of disk space - if that happens to be a concern.
On the other hand, a RDBMS like SqLite will balloon out really fast but wont mind ridiculously long directory hierarchies.
In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).
I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.
Thanks in advance, and let me know if I need to be more specific.
EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).
There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.
Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.
Each file-open operation takes time. A large file only has to be opened once.
And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.
...Again, these are generalizations without knowing more about your specific application.
It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations
If you go for the lots-of-files solution, I suggest you a structure like
b/a/bar
b/a/baz
f/o/foo
because you have limits on the number of files in a directory.
The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.
I'd prefer to delegate this task to ext3 which should be rather good at it.
edit :
A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.
The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)
I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.
My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.
I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.
We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.