Lots of small files or a couple huge ones?

Lots of small files or a couple huge ones? - file

In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).
I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.
Thanks in advance, and let me know if I need to be more specific.
EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).

There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.
Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.
Each file-open operation takes time. A large file only has to be opened once.
And, in considering disk performance, a single file is much more likely to be stored contiguously than a large series of files.
...Again, these are generalizations without knowing more about your specific application.

It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations
If you go for the lots-of-files solution, I suggest you a structure like
b/a/bar
b/a/baz
f/o/foo
because you have limits on the number of files in a directory.

The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.
I'd prefer to delegate this task to ext3 which should be rather good at it.
edit :
A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.
The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)

I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.
My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.

I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.
We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.

Related

what is the best way to read and parse millions of files on WindowsNT

i have millions of files in one directory(on directory with many child directorys),
these files are all small files.
i think there are 2 challenges:
how to traverse the directory to find all files. i have try the 'FindFirstFile/FindNextFile' way, but i feel it is too slow.Should I use the Windows Change Journal?
after i have find all filenames, i need read a whole file to memory,and then parse it.Should I use the FILE_FLAG_SEQUENTIAL_SCAN flag? or is there more efficient way?

Some ideas to kick around..
Text Crawler - Awesome indispensable Windows Search Tool - http://digitalvolcano.co.uk/textcrawler.html
Microsoft log parser - http://technet.microsoft.com/en-us/scriptcenter/dd919274.aspx
If you have a SQL (or MySQL) Server that has enough space, you can setup a SQL Job to import/link to the files in question, then you can query them
What my fear is that if you load the content of the file/s into memory, you are going to run out of server memory quickly. What you need to do is locate the files in question and write the results to a log or report that you can parse out and interpret.

NTFS, or in fact any non-specialized file system will be slow with millions of small files. That's the territory of databases.
If the files are in fact small, it doesn't matter at all how you read them. Overhead costs will dominate. It may be worthwhile to use a second thread, but a third thread is unlikely to help further.
Also, use FindFirstFileEx to speed up the search. You don't need alternate file names but would prefer a larger buffer.

You can use NtQueryDirectoryFile with a large buffer (say, 64 KB) to query for the children.
This function is the absolute limit to the fastest you can possibly communicate with the file system.
If that doesn't work for you, you can read the NTFS file table directly, but that means you'll have to have administrative privileges and will need to implement the file system reader by hand.

Good Filesystems and/or Archive Systems for Many Small Files

I am looking to store many (500 million - 900 million) small (2kB-9kB) files.
Most general purpose filesystems seem unfit for this as they are either unable to handle the sheer number of files, slow down with many files or have exceedingly large block sizes.
This seems to be a common problem, however all solutions I could find seem to end up just accepting a hit to storage efficiency when storing small files on inodes roughly the same size as themselves.
thus
Are there any filesystems specifically designed to handle hundreds of millions of small files?
or
Is there a production level solution for archiving the small files on the fly and writing one large file to disk?

Our SolFS supports page sizes of as little as 512 bytes and lets you create a virtual file system in a file, thus combining all of your files into one storage file. Performance, though, depends on how files are stored (hierarchically or in one folder), and is in general specific to usage scenarios.

How to efficiently store hundrets of thousands of documents?

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.
I was thinking about several possibilities:
file system - probably not that good idea to store 1m documents
sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
The storage I'm looking for should be:
fast
scallable
open-source (not crucial but nice to have)
Can you recommend what's the best way of storing those files will be in your opinion?

A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.

You can follow Facebook as it stores a lot of files (15 billion photos):
They Initially started with NFS share served by commercial storage appliances.
Then they moved to their onw implementation http file server called Haystack
Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919
Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.
You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group.
For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.
Hope that helps.

In my opinion...
I would store files compressed onto disk (file system) and use a database to keep track of them.
and posibly use Sqlite if this is its only job.

File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

Best way to bundles photos with app: files or in sqlite database?

Lets say that I have an app that lets you browse through a listing of cars found in a Sqlite database. When you click on a car in the listing, it'll open up a view with the description of the car and a photo of the car.
My question is: should I keep the photo in the database as a binary data column in the row for this specific car, or should I have the photo somewhere in the resources directory? Which is better to do? Are there any limitations of Sqlite in terms of how big a binary data column can be?
The database will be pretty much be read only and bundled with the app (so the user wouldn't be inserting any cars and their photos).

This is a decision which is discussed quite a lot. And in my opinion, it's a matter of personal taste. Pretty much like vim/emacs, windows/linux kind of debates. Not that heated though.
Both sides have their advantages and disadvantages. When you store them in a database, you don't need to worry much about filename and location. Management is also easier (you delete the row(s) containing the BLOBs, and that's it). But the files are also harder to access and you may need to write wrapper code in one way or the other (f.ex. some of those "download.php" links).
On the other hand, if the binary data is stored as a separate file, management is more complicated (You need to open the proper file from the disk by constructing the filename first). On large data sets you may run into filesystem bottlenecks when the number of files in one directory grows very large (but this can be prevented by creating subdirs easily). But then, if the data is stored as files, replacing them becomes so much easier. Other people can also access it without the need to know the internals (imagine for example a user who takes fun in customizing his/her UI).
I'm certain that there are other points to be raised, but I don't want to write too much now...
I would say: Think about what operations you would like to do with the photos (and the limitations of both storage methods), and from there take an informed decision. There's not much that can go wrong.
TX-Log and FS-Journal
On further investigation I found some more info:
SQLite uses a Transaction Log
Android uses YAFFS on the system mount points and VFAT on the SD-Card. Both are (to the best of my knowlege) unjournaled.
I don't know the exact implementation of SQLite's TX-Log, but it is to be expected that each INSERT/UPDATE operation will perform two writes on the disk. I may be mistaken (it largely depends on the implementation of transactions), but frankly, I couldn't be bothered to skim through the SQLite source code. It feels to me like we start splitting hairs here (premature optimization anyone?)...
As both file systems (YAFFS and VFAT) are not journalled, you have no additional "hidden" write operations.
These two points speak in favour of the file system.
Note that this info is to be taken with a grain of salt. I only skimmed over the Google results of YAFFS journaling and sqlite transaction log. I may have missed some details.

The default MAX SQLite size for a BLOB or String is 231-1 bytes. That value also applies to the maximum bytes to be stored in a ROW.
As to which method is better, I do not know. What I would do in your case is test both methods and monitor the memory usage and its effect on battery life. Maybe you'll find the filesystem method has a clear advantage over the other in that area and the ease of storing it in SQLite is not worth it for your use case, or maybe you'll find the opposite.

Non-file FileSystems?

I've been thinking on this for a while now (you know, that dangerous thing programmers tend to do) and I've been wondering, is the method of storing data that we're so accustomed to really all that efficient? The trouble with answering this question is that I really don't have anything to compare it to, since it's the only thing I've ever used.
I don't mean FAT or NTFS or a particular type of file system, I mean the filesystem structure as a whole. We are simply used to thinking of "files" inside "folders" like our hard drive was one giant filing cabinet. This is a great analogy and indeed, it makes it a lot easier to learn when we think of it this way, but is it really the best way to go about describing programs and their respective parts?
I'd like to know if anyone can think of (or knows about) a data storage technique that might be used to store data for an Operating System to use that would organize the parts of data in a different manner. Does anything... different even exist?

Emails are often stored in folders. But ever since I have migrated to Gmail, I have become accustomed to classifying my emails with tags.
I often wondered if we could manage a whole file-system that way: instead of storing files in folders, you could tag files with the tags you like. A file identifier would not look like this:
/home/john/personal/contacts.txt
but more like this:
contacts[john,personal]
Well... just food for thought (maybe this already exists!)

You can for example have dedicated solutions, like Oracle Raw Partitions. Other databases support similar thing. In these cases the filesystem provides unnecessary overhead and can be ommited - DB software will take care of organising the structure.
The problem seems very application dependent and files/folders seem to be a reasonable compromise for many applications (and is easy for human beings to comprehend).

Mainframes used to just give programmers a number of 'devices' to use. The device corresponsed to a drive or a partition thereof and the programmer was responsible for the organisation of all data on it. Of course they quickly built up libraries to help with that.
The only OS I think think of that does use the common hierachical arrangement of flat files (like UNIX) is PICK. That used a sort of relational database as the filesystem.

Microsoft had originally planned to introduce a new file-system for windows vista (WinFS - windows future storage). The idea was to store everything in a relational database (SQL Server). As far as I know, this project was never (or not yet?) finished.
There's more information about it on wikipedia.

I knew a guy who wrote his doctorate about a hard disk that comes with its own file system. It was based on an extension of SCSI commands that allowed the usual open, read, write and close commands to be sent to the disk directly, bypassing the file system drivers of the OS. I think the conclusion was that it is inflexible, and does not add much efficiency.
Anyway, this disk based file system still had a folder like structure I believe, so I don't think it really counts for you ;-)

Well, there's always Pick, where the OS and file system were an integrated database.

Traditional file systems are optimized for fast file access if you know the name of the file you want (including its path). Directories are a way of grouping files together so that they're easier to find if you know properties of the file but not its actual name.
Traditional file systems are not good at finding files if you know very little about them, however they are robust enough that one can add a layer on top of them to aid in retrieving files based on content or meta-information such as tags. That's what indexers are for.
The bottom line is we need a way to store persistently the bytes that the CPU needs to execute. So we have traditional file systems which are very good at organizing sequential sets of bytes. We also need to store persistently the bytes of files that aren't executed directly, but are used by things that do execute. Why create a new system for the same fundamental thing?
What more should a file system do other than store and retrieve bytes?

I'll echo the other responses. If I could pick a filesystem type, I personally would rather see a hybrid approach: a flat database of subtrees, where each subtree is considered as a cohesive unit, but if you consider the subtrees themselves as discrete units they would have no hierarchy, but instead could have metadata + be queryable on that metadata.

The reason for files is that humans like to attach names to "things" they have to use. Otherwise, it becomes hard to talk or think about or even distinguish them.
When we have too many things on a heap, we like to separate the heap. We sort it by some means and we like to build hierarchies where you can navigate arbitrarily sized amounts of things.
Hence directories and files just map our natural way of working with real objects. Since you can put anything in a file. On Unix, even hardware is mapped as "device nodes" into the file system which are special files which you can read/write to send commands to the hardware.
I think the metaphor is so powerful, it will stay.

I spent a while trying to come up with an automagically versioning file system that would maintain versions (and version history) of any specific file and/or directory structure.
The idea was that all of the standard access command (e.g. dir, read, etc.) would have an optional date/time parameter that could be passed to access the file system as it looked at that point in time.
I got pretty far with it, but had to abandon it when I had to actually go out and earn some money. It's been on the back-burner since then.

If you take a look at the start-up times for operating systems, it should be clear that improvements in accessing disks can be made. I'm not sure if the changes should be in the file system or rather in the OS start-up code.

Personally, I'm really sorry WinFS didn't fly. I loved the concept..
From Wikipedia (http://en.wikipedia.org/wiki/WinFS) :
WinFS includes a relational database
for storage of information, and allows
any type of information to be stored
in it, provided there is a well
defined schema for the type.
Individual data items could then be
related together by relationships,
which are either inferred by the
system based on certain attributes or
explicitly stated by the user. As the
data has a well defined schema, any
application can reuse the data; and
using the relationships, related data
can be effectively organized as well
as retrieved. Because the system knows
the structure and intent of the
information, it can be used to make
complex queries that enable advanced
searching through the data and
aggregating various data items by
exploiting the relationships between
them.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight