Ideal number of files per folder for large sets of files - filesystems

I am currently in the process of designing a simple repository that uses the file system to store documents. There is the future potential for millions of files and the strategy that I want to use to map an ID to a location on disk is a means of hashing the ID and using part of the hash to determine the directory it should live in.
A common operation will be reading through all of the files per folder and any of it's nested folders.
My question is: is there an ideal ratio of files per directory? I have
the means to control this ratio via the ID -> location algorithm. Any
data to back answers up would be great.

If performance is what you're worrying about, this will depend the
type of filesystem you are using. Older filesystems like
ext2 kept
directory entries in a linear list. Looking up a particular file in a
directory could be very expensive.
Modern filesystems such as ext4,
btrfs,
xfs
and others typically have indexed directories, the access time of a
single file in a huge directory isn't going to be appreciably
different from accessing a single file in a small directory. In fact,
spreading millions of files over many subdirectories may give you
slower lookup performance than having them all in a single directory!
If you are writing your own software that will do a lot of linear
scans of the entire set of files or access individual files by name,
it probably doesn't matter which way you go about it (as long as you
access it the right
way.
I would worry more about managing the file system outside of the
application. Typical system utilities (like ls) may use readdir() or
linear scans of directories. To prevent the sysadmins from having
terrible headaches when diagnosing issues within the directory
structure, I'd go with something agreeably bushy, and 10k-20k entries
per directory (assuming indexed directories) would work.
When choosing your layout, you may wish to watch out for limits on the
number of subdirectories allowed per directory (i.e. 64000 on ext4).

Related

Store big amount of data

My question is what is the best way to store a lot of files on a server. I did some searching and what I know so far is that its a bad idea to store all files in a single directory. Also I know that some filesystems have a subdirectory limit so it is not a good idea to create for every file a new directory. I read also some approache about using the hash of the file und build the path to store the file from this string. But I think if I do this I will end up with a lot of subdirectorys wich is maybe not a perfect solution.
There are tons of storage options, available on the network to store a lot of data.The right solution can often depend on specific needs. So if you are looking for a cheap and effective solution I would suggest using RAID(Redundant Array of Independent Disks).
1)RAID is a way of storing your data across multiple disks so that if something happens to one hard drive, none of your data will be lost. You can actually build your own server that uses RAID to back up your data files.RAID-5 with onboard controllers(using a proper dedicated controller)
2)unRAID is no longer confined to the capabilities of a single OS. It lets you partition system resources, enabling you to store and protect data as well as run any application in isolated environments.
3)If you want to store large amount of files, then the approach should not let any file should have a size more than 3-5MB, the moment it crosses it should you create a new file with the next revision number. In that way, you can keep the chain on. The moment the folder size crosses 1GB, create a new folder with then the next revision number and make sure the disk is NTFS partitioned and have enough space as per your requirement.
Hope it helps.

Multiple Directories or One - data storage and access

This is more of a algorithm question, but I am trying to figure out what would be the most efficient for a large database of pictures. Would it make more sense to store a lot of files all under one directory (ex:pictures/userid_pic_profile.png) or multiple directories for a few number of files (ex: userid/profile.png userid/avatar.png)?
For organizational reasons alone you should be using multiple directories.
Additionally, for some operating systems having a very large number of files in one directory can cause real slowdowns when listing and searching for files (I am talking about thousands and tens of thousands of files in a single directory).

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

Lots of small files or a couple huge ones?

In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).
I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.
Thanks in advance, and let me know if I need to be more specific.
EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).
There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.
Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.
Each file-open operation takes time. A large file only has to be opened once.
And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.
...Again, these are generalizations without knowing more about your specific application.
It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations
If you go for the lots-of-files solution, I suggest you a structure like
b/a/bar
b/a/baz
f/o/foo
because you have limits on the number of files in a directory.
The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.
I'd prefer to delegate this task to ext3 which should be rather good at it.
edit :
A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.
The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)
I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.
My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.
I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.
We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.

What is the best way to associate a file with a piece of data?

I have an application that creates records in a table (rocket science, I know). Users want to associate files (.doc, .xls, .pdf, etc...) to a single record in the table.
Should I store the contents of the
file(s) in the database? Wouldn't this
bloat the database?
Should I store the file(s) on a file
server, and store the path(s) in the
database?
What is the best way to do this?
I think you've accurately captured the two most popular approaches to solving this problem. There are pros and cons to each:
Store the Files in the DB
Most rbms have support for storing blobs (or binary file data, .doc, .xls, etc.) in a db. So you're not breaking new ground here.
Pros
Simplifies Backup of the data: you backup the db you have all the files.
The linkage between the metadata (the other columns ABOUT the files) and the file itself is solid and built into the db; so its a one stop shop to get data about your files.
Cons
Backups can quickly blossom into a HUGE nightmare as you're storing all of that binary data with your database. You could alleviate some of the headaches by keeping the files in a separate DB.
Without the DB or an interface to the DB, there's no easy way to get to the file content to modify or update it.
In general, its harder to code and coordinate the upload and storage of data to a DB vs. the filesystem.
Store the Files on the FileSystem
This approach is pretty simple, you store the files themselves in the filesystem. Your database stores a reference to the file's location (as well as all of the metadata about the file). One helpful hint here is to standardize your naming schema for the files on disk (don't use the file that the user gives you, create one on your own and store theirs in the db).
Pros
Keeps your file data cleanly separated from the database.
Easy to maintain the files themselves (if you need to change out the file or update it), you do so in the file system itself. You can just as easily do it from the application as well via a new upload.
Cons
If you're not careful, your database about the files can get out of sync with the files themselves.
Security can be an issue (again if you're careless) depending on where you store the files and whether or not that filesystem is available to the public (via the web I'm assuming here).
At the end of the day, we chose to go the filesystem route. It was easier to implement quickly, easy on the backup, pretty secure once we locked down any holes and streamed the file out (instead of just serving directly from the filesystem). Its been operational in pretty much the same format for about 6 years in two different government applications.
J
How well you can store binaries, or BLOBs, in a database will be highly dependant on the DBMS you are using.
If you store binaries on the file system, you need to consider what happens in the case of file name collision, where you try and store two different files with the same name - and if this is a valid operation or not. So, along with the reference to where the file lives on the file system, you may also need to store the original file name.
Also, if you are storing a large amount of files, be aware of possible performance hits of storing all your files in one folder. (You didn't specify your operating system, but you might want to look at this question for NTFS, or this reference for ext3.)
We had a system that had to store several thousands of files on the file system, on a file system where we were concerned about the number of files in any one folder (it may have been FAT32, I think).
Our system would take a new file to be added, and generate an MD5 checksum for it (in hex). It would take the first two characters and make that the first folder, the next two characters and make that the second folder as a sub-folder of the first folder, and then the next two as the third folder as a sub-folder of the second folder.
That way, we ended up with a three-level set of folders, and the files were reasonably well scattered so no one folder filled up too much.
If we still had a file name collision after that, then we would just add "_n" to the file name (before the extension), where n was just an incrementing number until we got a name that didn't exist (and even then, I think we did atomic file creation, just to be sure).
Of course, then you need tools to do the occasional comparison of the database records to the file system, flagging any missing files and cleaning up any orphaned ones where the database record no longer exists.
You should only store files in the database if you're reasonably sure you know that the sizes of those files aren't going to get out of hand.
I use our database to store small banner images, which I always know what size they're going to be. Your database will store a pointer to the data inside a row and then plunk the data itself somewhere else, so it doesn't necessarily impact speed.
If there are too many unknowns though, using the filesystem is the safer route.
Use the database for data and the filesystem for files. Simply store the file path in the database.
In addition, your webserver can probably serve files more efficiently than you application code will do (in order to stream the file from the DB back to the client).
Store the paths in the database. This keeps your database from bloating, and also allows you to separately back up the external files. You can also relocate them more easily; just move them to a new location and then UPDATE the database.
One additional thing to keep in mind: In order to use most of the filetypes you mentioned, you'll end up having to:
Query the database to get the file contents in a blob
Write the blob data to a disk file
Launch an application to open/edit/whatever the file you just created
Read the file back in from disk to a blob
Update the database with the new content
All that as opposed to:
Read the file path from the DB
Launch the app to open/edit/whatever the file
I prefer the second set of steps, myself.
The best solution would be to put the documents in the database. This simplifies all the linking and backingup and restoring issues - But it might not solve the basic 'we just want to point to documents on our file server' mindset the users may have.
It all depends (in the end) on actual user requirements.
BUt my recommendation would be to put it all together in the database so you retain control of them. Leaving them in the file system leaves them open to being deleted, moved, ACL'd or anyone of hundreds of other changes that could render your linking to them pointless or even damaging.
Database bloat is only an issue if you haven't sized for it. Do some tests and see what effects it has. 100GB of files on a disk is probably just as big as the same files in a database.
I would try to store it all in the database. Haven't done it. But if not. There are a small risk that file names get out of sync with files on the disk. Then you have a big problem.
And now for the completely off the wall suggestion - you could consider storing the binaries as attachments in a CouchDB document database. This would avoid the file name collision issues as you would use a generated UID as each document ID (which you what you would store in your RDBMS), and the actual attachment's file name is kept with the document.
If you are building a web-based system, then the fact that CouchDB uses REST over HTTP could also be leveraged. And, there's also the replication facilities that could prove of use.
Of course, CouchDB is still in incubation, although there are some who are already using it 'in the wild'.
I would store the files in the filesystem. But to keep the linking to the files robust, i.e. to avoid the cons of this option, I would generate some hash for each file, and then use the hash to retrieve it from the filesystem, without relying on the filenames and/or their path.
I don't know the details, but I know that this can be done, because this is the way BibDesk works (a BibTeX app for the Mac OS). It is wonderful software, used to keep tracks of the pdf attachments to the database of scientific literature that it manages.

Resources