This is more of a algorithm question, but I am trying to figure out what would be the most efficient for a large database of pictures. Would it make more sense to store a lot of files all under one directory (ex:pictures/userid_pic_profile.png) or multiple directories for a few number of files (ex: userid/profile.png userid/avatar.png)?
For organizational reasons alone you should be using multiple directories.
Additionally, for some operating systems having a very large number of files in one directory can cause real slowdowns when listing and searching for files (I am talking about thousands and tens of thousands of files in a single directory).
Related
I am currently in the process of designing a simple repository that uses the file system to store documents. There is the future potential for millions of files and the strategy that I want to use to map an ID to a location on disk is a means of hashing the ID and using part of the hash to determine the directory it should live in.
A common operation will be reading through all of the files per folder and any of it's nested folders.
My question is: is there an ideal ratio of files per directory? I have
the means to control this ratio via the ID -> location algorithm. Any
data to back answers up would be great.
If performance is what you're worrying about, this will depend the
type of filesystem you are using. Older filesystems like
ext2 kept
directory entries in a linear list. Looking up a particular file in a
directory could be very expensive.
Modern filesystems such as ext4,
btrfs,
xfs
and others typically have indexed directories, the access time of a
single file in a huge directory isn't going to be appreciably
different from accessing a single file in a small directory. In fact,
spreading millions of files over many subdirectories may give you
slower lookup performance than having them all in a single directory!
If you are writing your own software that will do a lot of linear
scans of the entire set of files or access individual files by name,
it probably doesn't matter which way you go about it (as long as you
access it the right
way.
I would worry more about managing the file system outside of the
application. Typical system utilities (like ls) may use readdir() or
linear scans of directories. To prevent the sysadmins from having
terrible headaches when diagnosing issues within the directory
structure, I'd go with something agreeably bushy, and 10k-20k entries
per directory (assuming indexed directories) would work.
When choosing your layout, you may wish to watch out for limits on the
number of subdirectories allowed per directory (i.e. 64000 on ext4).
I have an app that keeps a database of files located on the user's machine or perhaps on networked volumes that may or may not be online. This database can potentially be several thousand files located in different folders. What is the best way to monitor them to receive notification when a file's name is changed, or it moves or is deleted?
I have used FSEvents before for a single directory but I am guessing that it does not scale well to a few thousand individual files. What about using kqueues?
I might be able to try to maintain a dynamic list of folders trying to encompass all the files with as few folders as possible, but this means reading though the full list and trying to figure out common ancestors etc.
Thoughts or suggestions?
From Apple's docs:
If you are monitoring a large hierarchy of content, you should use
file system events instead, however, because kernel queues are
somewhat more complex than kernel events, and can be more resource
intensive because of the additional user-kernel communication
involved.
https://developer.apple.com/library/mac/documentation/Darwin/Conceptual/FSEvents_ProgGuide/KernelQueues/KernelQueues.html#//apple_ref/doc/uid/TP40005289-CH5-SW2
I am looking to store many (500 million - 900 million) small (2kB-9kB) files.
Most general purpose filesystems seem unfit for this as they are either unable to handle the sheer number of files, slow down with many files or have exceedingly large block sizes.
This seems to be a common problem, however all solutions I could find seem to end up just accepting a hit to storage efficiency when storing small files on inodes roughly the same size as themselves.
thus
Are there any filesystems specifically designed to handle hundreds of millions of small files?
or
Is there a production level solution for archiving the small files on the fly and writing one large file to disk?
Our SolFS supports page sizes of as little as 512 bytes and lets you create a virtual file system in a file, thus combining all of your files into one storage file. Performance, though, depends on how files are stored (hierarchically or in one folder), and is in general specific to usage scenarios.
I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?
A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.
You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.
In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.
File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)
In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).
I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.
Thanks in advance, and let me know if I need to be more specific.
EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).
There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.
Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.
Each file-open operation takes time. A large file only has to be opened once.
And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.
...Again, these are generalizations without knowing more about your specific application.
It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations
If you go for the lots-of-files solution, I suggest you a structure like
b/a/bar
b/a/baz
f/o/foo
because you have limits on the number of files in a directory.
The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.
I'd prefer to delegate this task to ext3 which should be rather good at it.
edit :
A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.
The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)
I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.
My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.
I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.
We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.