hash indexed directories

hash indexed directories - filesystems

I was just going through the minix filesystem when I got curious about the various filesystems out there. On reading about the features that are supported by ext2 I saw that hash indexed directories is one among them.
Could someone enlighten me on as to what this is?

See the link below for details on Ext2/3 Htree directory indexing:
http://ext2.sourceforge.net/2005-ols/paper-html/node3.html
Basically the filesystem uses a hash tree to store directory entries, rather than a linear list. That results in a siginificant performance improvement in directory lookup operations.

Related

Storing a large hash on HDD

I am trying to store a simple large hashtable (64it key, 64bit value) about 80Gb in size on a hard drive. What is the most efficient way to do it if I want to get best performance?
Keys to look up are totally random and I have to look up every 10ms? Is there an abstraction available as a C/Linux library which can map/hash the key to Logical Block Address of HDD so that access will be faster?
Please give some guidelines.

You might use a memory mapped file (mmap), and then arange your data in such a way that you would only read one page for each lookup. This could be done by having all keys sorted in the file, and then have an in memory index that holds the first key of each page.

Rely on the file system to do the work, and use the hash to form a file system path & file name. E.g., at 64 bits, assume your key, in 16 hex characters, is
5a5bf28dcd794499
Store that hash's value in file \5a\5b\f2\8d\cd\79\44\99.txt
This scheme only loads each subdirectory with a max of 256 folders/files. Git does this, but only goes one directory deep, probably assuming (reasonably) that you won't commit billions of files to your git store.

Some general guidelines:
Use open addressing with linear probing to resolve collisions. This should result in at most a single HDD access per lookup.
On a 64-bit system, try mmaping the file for better cache performance.
It might help to create a separate partition and access it directly via /dev/sd??.

Selecting a file randomly from a file system

This question relates to Simulating file system access .
I need to choose files and directories randomly to act as arguments of file operations like rename, write, read etc. What I was planning to do was to make a list of all files and directories with thier paths and randomly select from this list. But, as files and directories are created and deleted in the actual file system, the list also has to be updated. I am finding maintaining the list and updating it in this manner to be inefficient and it also has to be atomic so that a later operation does not access a file that was deleted by a previous operation.
Can you suggest a different way of selecting the files ..maybe someway to do it diretly from the file system...but how would we know paths to files then.
Thanks
I found something interesting here Randomly selecting a file from a tree of directories in a completely fair manner specially in
Michael J. Barber's answer, but not being able to follow it completely due to my python ignorance

You don't want to try to maintain a list of files when the filesystem is right there. You should be able to do it right from C. Walk from the root directory, selecting a random file from it. You can pick a random maximum depth, and if you hit a regular file, at or before this, use it. If it's a directory, repeat up to max depth. If it's a special file, maybe start over.
This should be pretty quick. The operation shouldn't have to be atomic. If the file's not there when you want do your operation, try again. Shouldn't be too complicated. You can build the path up as you find your target file. This will be simpler than fussing with the fs directly (I assume you meant at a much lower level) and should be simple to implement.

Here is my proposed solution. It is not the fastest, but should be quick (after preparation), use only modest memory, and be "fairly well-distributed". This is, of course, 100% untested and somewhat complex (as complex as maintain a RB-tree or similar, anyway) -- I pitty one for having to use C ;-)
For each directory in the target domain, build a directory tree using a depth-first walk of the filesystem and record the "before" file count (files found to date, in tree) and the "after" file count (the "before" count plus the number of files in directory). It should not store the files themselves. Fast way to find the number of files gives some example C code. It still requires iteration of the directory contents but does not need to store the files themselves.
Count up the total number of files in the tree. (This should really just be the "after" count of the last node in the tree.)
Pick a random number in the range [0, max files).
Navigate to the node in the tree such that the "before" file count <= random number < "after" file count. This is just walking down the (RB-)tree structure and is itself O(lg n) or similar.
Pick a random file in the directory associated with the selected node -- make sure to count the directory again and use this as the [0, limit] in the selection (with a fallback in case of running-off-the-end due to concurrency issues). If the number of files changed, make sure to update the tree with such information. Also update/fix the tree if the directory has been deleted, etc. (The extra full count here shouldn't be as bad as it sounds, as readdir (on average) must already be navigated through 1/2 the entries in the directory. However, the benefit of the re-count, if any, should be explored.)
Repeat steps 2-5 as needed.
Periodically rebuild the entire tree (step #1) to account for filesystems changes. Deleting/adding files will slowly skew the randomness -- step #5 can help to update the tree in certain situations. The frequency of rebuild should be determined through experimentation. It might also be possible to reduce the error introduction with rebuilding the parent/grandparent nodes or [random] child nodes each pass, etc. Using the modified time as a fast way to detect changes may also be worth looking into.
Happy coding.

All you should know is how many files are in each directory in order to pick directory in which you should traverse. Avoid traversing over symbolic links and counting files in symbolic links.
You can use similar solution as pst described.
Example you have 3 directories and there are 20,40 and 1000 files in each.
You make total [20,60,1060] and you pick random number 0-1060. if this number if greater or equal 60 you go 3rd folder.
You stop traversing once you reach folder whitout folders.
To find random file trough this path you can apply same trick as before.
This way you will pick any file whit equal probability.

How to implement B+ Tree for file systems?

I have a text file which contains some info on extents about all the files in the file system, like below
C:\Program Files\abcd.txt
12345 100
23456 200
C:\Program Files\bcde.txt
56789 50
26746 300
...
Now i have another binary which tries to find out about extents for all the files.
Now currently i am using linear search to find extent info for the files in the above mentioned text file. This is a time consuming process. Is there a better way of coding this ? Like Implementing any good data structure like BTree. If B+ Tree is used what is the key, branch factor i need to use ?

Use a database.
The key points in implementing a tree in a file are to have fixed record lengths and to use file offsets instead of pointers.
Use a database. Hmmm, SQL Lite.
Another point to consider with files is that reading in chunks of data is faster than reading individual items (regardless of whether or not the hard disk has a cache or the OS has a cache). I implemented a B+Tree, which uses pages as it's nodes.
Use a database. Databases have already been written and tested.
A more efficient design is to keep the initial node in memory. This reduces the number of fetches from the file. If your program has the space, keeping the first couple of levels in memory may also speed up execution.
Use a database.
I gave up writing a B-Tree implementation for my application because I wanted to concentrate on the other functionality of the program. I later learned that in the real world (the world where programs need to be finished on a schedule) that time should be spent on the 'core' of the application rather than accessories that have already been written and tested (a.k.a. off-the-shelf).

It depends on how do you want to search your file. I assume that you want to look up your info given a file name. Then a hash table or a Trie would be a good data structure to use.
The B-tree is possible but not the most convenient choice given that your keys are strings.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan

Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.

Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.

Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.

Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)

Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.

This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).

The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.

Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Fastest file access/storage?

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!

This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.

That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.

This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt

This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.

First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...

Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.

Why isn't storing the paths in a database table acceptable?

My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.

I'd check out hadoops model.
P

I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight