Storing a large hash on HDD

Storing a large hash on HDD - c

I am trying to store a simple large hashtable (64it key, 64bit value) about 80Gb in size on a hard drive. What is the most efficient way to do it if I want to get best performance?
Keys to look up are totally random and I have to look up every 10ms? Is there an abstraction available as a C/Linux library which can map/hash the key to Logical Block Address of HDD so that access will be faster?
Please give some guidelines.

You might use a memory mapped file (mmap), and then arange your data in such a way that you would only read one page for each lookup. This could be done by having all keys sorted in the file, and then have an in memory index that holds the first key of each page.

Rely on the file system to do the work, and use the hash to form a file system path & file name. E.g., at 64 bits, assume your key, in 16 hex characters, is
5a5bf28dcd794499
Store that hash's value in file \5a\5b\f2\8d\cd\79\44\99.txt
This scheme only loads each subdirectory with a max of 256 folders/files. Git does this, but only goes one directory deep, probably assuming (reasonably) that you won't commit billions of files to your git store.

Some general guidelines:
Use open addressing with linear probing to resolve collisions. This should result in at most a single HDD access per lookup.
On a 64-bit system, try mmaping the file for better cache performance.
It might help to create a separate partition and access it directly via /dev/sd??.

Related

Better to store data in RAM, text file, or database

I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.

As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.

You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.

Strategy for mass storage of small files

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.
I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.
I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.
I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.
I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.
I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.
It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.
I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.
Using CentOS 6. HDD is SSD RAID 1.
What would be good and sensible solution for my task that will solve auto-pruning performance problem?

Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.
Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?
If the answer is "no" to the second of these questions, try this:
Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.
You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.
If you're OK with multiple things being in the same file:
Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.
Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.
Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?

If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.
To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.
The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.
If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.

If you don't need to append to the small files, I would suggest that you create a big file and do a sequential write of the small files right in it, while keeping records of offsets and sizes of all the small files within that big file.
As you reach the end of the big file, start writing from its beginning again, while invalidating records of the small files in the beginning that you replace with new data.
If you choose the big file size properly, based new files saving rate, you can get auto-pruning of files older than ~20 minutes almost as you need.

The FAT, Linux, and NTFS file systems

I heard that the NTFS file system is basically a b-tree. Is that true? What about the other file systems? What kind of trees are they?
Also, how is FAT32 different from FAT16?
What kind of tree are the FAT file systems using?

FAT (FAT12, FAT16, and FAT32) do not use a tree of any kind. Two interesting data structures are used, in addition to a block of data describing the partition itself. Full details at the level required to write a compatible implementation in an embedded system are available from Microsoft and third parties. Wikipedia has a decent article as an alternative starting point that also includes a lot of the history of how it got the way it is.
Since the original question was about the use of trees, I'll provide a quick summary of what little data structure is actually in a FAT file system. Refer to the above references for accurate details and for history.
The set of files in each directory is stored in a simple list, initially in the order the files were created. Deletion is done by marking an entry as deleted, so a subsequent file creation might re-use that slot. Each entry in the list is a fixed size struct, and is just large enough to hold the classic 8.3 file name along with the flag bits, size, dates, and the starting cluster number. Long file names (which also includes international character support) is done by using extra directory entry slots to hold the long name alongside the original 8.3 slot that holds all the rest of the file attributes.
Each file on the disk is stored in a sequence of clusters, where each cluster is a fixed number of adjacent disk blocks. Each directory (except the root directory of a disk) is just like a file, and can grow as needed by allocating additional clusters.
Clusters are managed by the (misnamed) File Allocation Table from which the file system gets its common name. This table is a packed array of slots, one for each cluster in the disk partition. The name FAT12 implies that each slot is 12 bits wide, FAT16 slots are 16 bits, and FAT32 slots are 32 bits. The slot stores code values for empty, last, and bad clusters, or the cluster number of the next cluster of the file. In this way, the actual content of a file is represented as a linked list of clusters called a chain.
Larger disks require wider FAT entries and/or larger allocation units. FAT12 is essentially only found on floppy disks where its upper bound of 4K clusters makes sense for media that was never much more than 1MB in size. FAT16 and FAT32 are both commonly found on thumb drives and flash cards. The choice of FAT size there depends partly on the intended application.
Access to the content of a particular file is straightforward. From its directory entry you learn its total size in bytes and its first cluster number. From the cluster number, you can immediately calculate the address of the first logical disk block. From the FAT indexed by cluster number, you find each allocated cluster in the chain assigned to that file.
Discovery of free space suitable for storage of a new file or extending an existing file is not as easy. The FAT file system simply marks free clusters with a code value. Finding one or more free clusters requires searching the FAT.
Locating the directory entry for a file is not fast either since the directories are not ordered, requiring a linear time search through the directory for the desired file. Note that long file names increase the search time by occupying multiple directory entries for each file with a long name.
FAT still has the advantage that it is simple enough to implement that it can be done in small microprocessors so that data interchange between even small embedded systems and PCs can be done in a cost effective way. I suspect that its quirks and oddities will be with us for a long time as a result.

ext3 and ext4 use "H-trees", which are apparently a specialized form of B-tree.
BTRFS uses B-trees (B-Tree File System).
ReiserFS uses B+trees, which are apparently what NTFS uses.
By the way, if you search for these on Wikipedia, it's all listed in the info box on the right side under "Directory contents".

Here is a nice chart on FAT16 vs FAT32.
The numerals in the names FAT16 and
FAT32 refer to the number of bits
required for a file allocation table
entry.
FAT16 uses a 16-bit file allocation
table entry (2 16 allocation units).
Windows 2000 reserves the first 4 bits
of a FAT32 file allocation table
entry, which means FAT32 has a maximum
of 2 28 allocation units. However,
this number is capped at 32 GB by the
Windows 2000 format utilities.
http://technet.microsoft.com/en-us/library/cc940351.aspx

FAT32 uses 32bit numbers to store cluster numbers. It supports larger disks and files up to 4 GiB in size.
As far as I understand the topic, FAT uses File Allocation Tables which are used to store data about status on disk. It appears that it doesn't use trees. I could be wrong though.

How to implement B+ Tree for file systems?

I have a text file which contains some info on extents about all the files in the file system, like below
C:\Program Files\abcd.txt
12345 100
23456 200
C:\Program Files\bcde.txt
56789 50
26746 300
...
Now i have another binary which tries to find out about extents for all the files.
Now currently i am using linear search to find extent info for the files in the above mentioned text file. This is a time consuming process. Is there a better way of coding this ? Like Implementing any good data structure like BTree. If B+ Tree is used what is the key, branch factor i need to use ?

Use a database.
The key points in implementing a tree in a file are to have fixed record lengths and to use file offsets instead of pointers.
Use a database. Hmmm, SQL Lite.
Another point to consider with files is that reading in chunks of data is faster than reading individual items (regardless of whether or not the hard disk has a cache or the OS has a cache). I implemented a B+Tree, which uses pages as it's nodes.
Use a database. Databases have already been written and tested.
A more efficient design is to keep the initial node in memory. This reduces the number of fetches from the file. If your program has the space, keeping the first couple of levels in memory may also speed up execution.
Use a database.
I gave up writing a B-Tree implementation for my application because I wanted to concentrate on the other functionality of the program. I later learned that in the real world (the world where programs need to be finished on a schedule) that time should be spent on the 'core' of the application rather than accessories that have already been written and tested (a.k.a. off-the-shelf).

It depends on how do you want to search your file. I assume that you want to look up your info given a file name. Then a hash table or a Trie would be a good data structure to use.
The B-tree is possible but not the most convenient choice given that your keys are strings.

Fastest file access/storage?

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
Storing the file data in a database is not an option.
Files are going to be very small--maximum of probably 1 kb.
The drives are going to be solid state.
Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
Items will constantly be added, and sometimes deleted.
It would be impractical to consolidate multiple files into single files because there's no logical association between files.
I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!

This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.

That file algorithm will work, but it's not optimal. I would think that using 2 or 3 character "segments" would be better for performance - especially when you start considering doing backups.
For example:
d:\storage\fo\ob\ar\foobar.txt
or
d:\storage\foo\bar\foobar.txt
There are some benefits to using this sort of algorithm:
No database access is necessary.
Files will be spread out across many directories. If you don't spread them out, you'll hit severe performance problems. (I vaguely recall hearing about someone having issues at ~40,000 files in a single folder, but I'm not confident in that number.)
There's no need to search for a file. You can figure out exactly where a file will be from the file name.
Simplicity. You can very easily port this algorithm to just about any language.
There are some down-sides to this too:
Many directories may lead to slow backups. Imagine doing recursive diffs on these directories.
Scalability. What happens when you run out of disk space and need to add more storage?
Your file names cannot contain spaces.

This depends to a large extent on what file system you are going to store the files on. The capabilities of file systems in dealing with large number of files varies widely.
Your coworker is essentially suggesting the use of a Trie data structure. Using such a directory structure would mean that at each directory level there are only a handful of files/directories to choose from; this could help because as the number of files within a directory increases the time to access one of them does too (the actual time difference depends on the file system type.)
That said, I personally wouldn't go that many levels deep -- three to four levels ought to be enough to give the performance benefits -- most levels after that will probably have very entries (assuming your file names don't follow any particular patterns.)
Also, I would store the file itself with its entire name, this will make it easier to traverse this directory structure manually also, if required.
So, I would store foobar.txt as f/o/o/b/foobar.txt

This highly depends on many factors:
What file system are you using?
How large is each file?
What type of drives are you using?
What are the access patterns?
Accessing files purely at random is really expensive in traditional disks. One significant improvement you can get is to use solid state drive.
If you can reason an access pattern, you might be able to leverage locality of reference to place these files.
Another possible way is to use a database system, and store these files in the database to leverage the system's caching mechanism.
Update:
Given your update, is it possbile you consolidate some files? 1k files are not very efficient to store as file systems (fat32, ntfs) have cluster size and each file will use the cluster size anyway even if it is smaller than the cluster size. There is usually a limit on the number of files in each folder, with performance concerns. You can do a simple benchmark by putting as many as 10k files in a folder to see how much performance degrades.
If you are set to use the trie structure, I would suggest survey the distribution of file names and then break them into different folders based on the distribution.

First of all, the file size is very small. Any File System will eat something like at least 4 times more space. I mean any file on disk will occupy 4kb for 1kb file. Especially on SSD disks, the 4kb sector will be the norm.
So you have to group several files into 1 physical file. 1024 file in 1 storage file seems reasonable. To locate the individual files in these storage files you have to use some RDBMS (PostgreSQL was mentioned and it is good but SQLite may be better suited to this) or similar structure to do the mapping.
The directory structure suggested by your friend sounds good but it does not solve the physical storage problem. You may use similar directory structure to store the storage files. It is better to name them by using a numerical system.
If you can, do not let them format as FAT32, at least NTFS or some recent File System of Unix flavor. As total size of the files is not that big, NTFS may be sufficient but ZFS is the better option...

Is there any relation between individual files? As far as access times go, what folders you put things in won't affect much; the physical locations on the disk are what matter.

Why isn't storing the paths in a database table acceptable?

My guess is he is thinking of a Trie data structure to create on disk where the node is a directory.

I'd check out hadoops model.
P

I know this is a few years late, but maybe this can help the next guy..
My suggestion use a SAN, mapped to a Z drive that other servers can map to as well. I wouldn't go with the folder path your friend said to go with, but more with a drive:\clientid\year\month\day\ and if you ingest more than 100k docs a day, then you can add sub folders for hour and even minute if needed. This way, you never have more than 60 sub folders while going all the way down to seconds if required. Store the links in SQL for quick retrieval and reporting. This makes the folder path pretty short for example: Z:\05\2004\02\26\09\55\filename.txt so you don't run into any 256 limitations across the board.
Hope that helps someone. :)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight