This question recurs frequently on StackOverflow, but I have read all the previous relevant answers, and have a slight twist on the question.
I have a 23Gb file containing 475 million lines of equal size, with each line consisting of a 40-character hash code followed by an identifier (an integer).
I have a stream of incoming hash codes - billions of them in total - and for each incoming hash code I need to locate it and print out corresponding identifier. This job, while large, only needs to be done once.
The file is too large for me to read into memory and so I have been trying to usemmap in the following way:
codes = (char *) mmap(0,statbuf.st_size,PROT_READ,MAP_SHARED,codefile,0);
Then I just do a binary search using address arithmetic based on the address in codes.
This seems to start working beautifully and produces a few million identifiers in a few seconds, using 100% of the cpu, but then after some, seemingly random, amount of time it slows down to a crawl. When I look at the process using ps, it has changed from status "R" using 100% of the cpu, to status "D" (diskbound) using 1% of the cpu.
This is not repeatable - I can start the process off again on the same data, and it might run for 5 seconds or 10 seconds before the "slow to crawl" happens. Once last night, I got nearly a minute out of it before this happened.
Everything is read only, I am not attempting any writes to the file, and I have stopped all other processes (that I control) on the machine. It is a modern Red Hat Enterprise Linux 64-bit machine.
Does anyone know why the process becomes disk-bound and how to stop it?
UPDATE:
Thanks to everyone for answering, and for your ideas; I had not previously tried all the various improvements before because I was wondering if I was somehow using mmap incorrectly. But the gist of the answers seemed to be that unless I could squeeze everything into memory, I would inevitable run into problems. So I squashed the size of the hash code to the size of the leading prefix that did not create any duplicates - the first 15 characters were enough. Then I pulled the resulting file into memory, and ran the incoming hash codes in batches of about 2 billion each.
The first thing to do is split the file.
Make one file with the hash-codes and another with the integer ids. Since the rows are the same then it will line up fine after the result is found. Also you can try an approach that puts every nth hash into another file and then stores the index.
For example, every 1000th hash key put into a new file with the index and then load that into memory. Then binary scan that instead. This will tell you the range of 1000 entries that need to be further scanned in the file. Yes that will do it fine! But probably much less than that. Like probably every 20th record or so will divide that file size down by 20 +- if I am thinking good.
In other words after scanning you only need to touch a few kilobytes of the file on disk.
Another option is to split the file and put it in memory on multiple machines. Then just binary scan each file. This will yield the absolute fastest possible search with zero disk access...
Have you considered hacking a PATRICIA trie algorithm up? It seems to me that if you can build a PATRICIA tree representation of your data file, which refers to the file for the hash and integer values, then you might be able to reduce each item to node pointers (2*64 bits?), bit test offsets (1 byte in this scenario) and file offsets (uint64_t, which might need to correspond to multiple fseek()s).
Does anyone know why the process becomes disk-bound and how to stop it?
Binary search requires a lot of seeking within the file. In the case where the whole file doesn't fit in memory, the page cache doesn't handle the big seeks very well, resulting in the behaviour you're seeing.
The best way to deal with this is to reduce/prevent the big seeks and make the page cache work for you.
Three ideas for you:
If you can sort the input stream, you can search the file in chunks, using something like the following algorithm:
code_block <- mmap the first N entries of the file, where N entries fit in memory
max_code <- code_block[N - 1]
while(input codes remain) {
input_code <- next input code
while(input_code > max_code) {
code_block <- mmap the next N entries of the file
max_code <- code_block[N - 1]
}
binary search for input code in code_block
}
If you can't sort the input stream, you could reduce your disk seeks by building an in-memory index of the data. Pass over the large file, and make a table that is:
record_hash, offset into file where this record starts
Don't store all records in this table - store only every Kth record. Pick a large K, but small enough that this fits in memory.
To search the large file for a given target hash, do a binary search in the in-memory table to find the biggest hash in the table that is smaller than the target hash. Say this is table[h]. Then, mmap the segment starting at table[h].offset and ending at table[h+1].offset, and do a final binary search. This will dramatically reduce the number of disk seeks.
If this isn't enough, you can have multiple layers of indexes:
record_hash, offset into index where the next index starts
Of course, you'll need to know ahead of time how many layers of index there are.
Lastly, if you have extra money available you can always buy more than 23 gb of RAM, and make this a memory bound problem again (I just looked at Dell's website - you pick up a new low-end workstation with 32 GB of RAM for just under $1,400 Australian dollars). Of course, it will take a while to read that much data in from disk, but once it's there, you'll be set.
Instead of using mmap, consider just using plain old lseek+read. You can define some helper functions to read a hash value or its corresponding integer:
void read_hash(int line, char *hashbuf) {
lseek64(fd, ((uint64_t)line) * line_len, SEEK_SET);
read(fd, hashbuf, 40);
}
int read_int(int line) {
lseek64(fd, ((uint64_t)line) * line_len + 40, SEEK_SET);
int ret;
read(fd, &ret, sizeof(int));
return ret;
}
then just do your binary search as usual. It might be a bit slower, but it won't start chewing up your virtual memory.
We don't know the back story. So it is hard to give you definitive advice. How much memory do you have? How sophisticated is your hard drive? Is this a learning project? Who's paying for your time? 32GB of ram doesn't seem so expensive compared to two days of work of person that makes $50/h. How fast does this need to run? How far outside the box are you willing to go? Does your solution need to use advanced OS concepts? Are you married to a program in C? How about making Postgres handle this?
Here's is a low risk alternative. This option isn't as intellectually appealing as the other suggestions but has the potential to give you significant gains. Separate the file into 3 chunks of 8GB or 6 chunks of 4GB (depending on the machines you have around, it needs to comfortably fit in memory). On each machine run the same software, but in memory and put an RPC stub around each. Write an RPC caller to each of your 3 or 6 workers to determine the integer associated with a given hash code.
Related
I have created a file browsing UI for an embedded device. On the embedded side, I am able to get all files in a directory off the hard disk and return stats such as name, size, modified, etc. This is done using opendir and closedir and a while loop that goes through every file until no files are left.
This is cool, until file counts reach large quantities. I need to implement pagination and sorting. Suppose I have 10,000 files in a directory - how can I possibly go through this amount of files and sort based on size, name, etc, without easily busting the RAM (about 1mb of RAM... !). Perhaps something already exists within the Hard Drive OS or drivers?
Here's two suggestions, both of which have small memory footprint. The first will use no more memory that the number of results you wish to return for the request. It's a constant-time O(1) memory - it only depends on the size of the result set but is ultimately quadratic time (or worse) if the user really does page through all results:
You are only looking for a small paged result (e.g the r=25 entries). You can generate these by scanning through all filenames and maintaining a sorted list of items you will return, using an insertion sort of length r and for each file inserted, only retain the first r results. (In practice you would not insert the file F if it is lower than the rth entry).
How would you generate the 2nd page of results? You already know the 25th file from the previous request - so during the scan ignore all entries that are before that. (You'll need to work harder if sorting on fields with duplicates)
The upside is the minimum memory required - the memory needed is not much larger than the r results you wish to return (and can even be less if you don't cache the names). The downside is generating the complete result will be quadratic in time in terms of the number of total files you have. In practice people don't sort results then page through all pages, so this may be acceptable.
If your memory budget is larger (e.g. fewer than 10000 files) but you still don't have enough space to perform a simple in-memory sort of all 10000 filenames then seekdir/telldir is your friend. i.e. create an array of longs by streaming readdir and using telldir to capture the position of each entry. (you might even be able to compress the delta between each telldir to a 2 byte short). As a minimal implementation you can then sort 'em all with clib's sort function and writing your own callback to convert a location into a comparable value. Your call back will use seekdir twice to read the two filenames.
The above approach is overkill - you just sorted all entries and you only needed one page of ~25, so for fun why not read up on Hoare's QuickSelect algorithm and use a version of it to identify the results within the required range. You can recursive ignore all entries outside the required range and only sort the entries between the first and last entry of the results.
What you want is an external sort, that's a sort done with external resources, usually on disk. The Unix sort command does this. Typically this is done with an external merge sort.
The algorithm is basically this. Let's assume you want to dedicate 100k of memory to this (the more you dedicate, the fewer disk operations, the faster it will go).
Read 100k of data into memory (ie. call readdir a bunch).
Sort that 100k hunk in-memory.
Write the hunk of sorted data to its own file on disk.
You can also use offsets in a single file.
GOTO 1 until all hunks are sorted.
Now you have X hunks of 100k on disk an each of them are sorted. Let's say you have 9 hunks. To keep within the 100k memory limit, we'll divide the work up into the number of hunks + 1. 9 hunks, plus 1, is 10. 100k / 10 is 10k. So now we're working in blocks of 10k.
Read the first 10k of each hunk into memory.
Allocate another 10k (or more) as a buffer.
Do an K-way merge on the hunks.
Write the smallest in any hunk to the buffer. Repeat.
When the buffer fills, append it to a file on disk.
When a hunk empties, read the next 10k from that hunk.
When all hunks are empty, read the resulting sorted file.
You might be able to find a library to perform this for you.
Since this is obviously overkill for the normal case of small lists of files, only use it if there are more files in the directory than you care to have in memory.
Both in-memory sort and external sort begin with the same step: start calling readdir and writing to a fixed-sized array. If you run out of files before running out of space, just do an in-memory quicksort on what you've read. If you run out of space, this is now the first hunk of an external sort.
I am working on a project where I am using words, encoded by vectors, which are about 2000 floats long. Now when I use these with raw text I need to retrieve the vector for each word as it comes across and do some computations with it. Needless to say for a large vocabulary (~100k words) this has a large storage requirement (about 8 GB in a text file).
I initially had a system where I split the large text file into smaller ones and then for a particular word, I read its file, and retrieved its vector. This was too slow as you might imagine.
I next tried reading everything into RAM (takes about ~40GB RAM) figuring once everything was read in, it would be quite fast. However, it takes a long time to read in and a disadvantage is that I have to use only certain machines which have enough free RAM to do this. However, once the data is loaded, it is much faster than the other approach.
I was wondering how a database would compare with these approaches. Retrieval would be slower than the RAM approach, but there wouldn't be the overhead requirement. Also, any other ideas would be welcome and I have had others myself (i.e. caching, using a server that has everything loaded into RAM etc.). I might benchmark a database, but I thought I would post here to see what other had to say.
Thanks!
UPDATE
I used Tyler's suggestion. Although in my case I did not think a BTree was necessary. I just hashed the words and their offset. I then could look up a word and read in its vector at runtime. I cached the words as they occurred in text so at most each vector is read in only once, however this saves the overhead of reading in and storing unneeded words, making it superior to the RAM approach.
Just an FYI, I used Java's RamdomAccessFile class and made use of the readLine(), getFilePointer(), and seek() functions.
Thanks to all who contributed to this thread.
UPDATE 2
For more performance improvement check out buffered RandomAccessFile from:
http://minddumped.blogspot.com/2009/01/buffered-javaiorandomaccessfile.html
Apparently the readLine from RandomAccessFile is very slow because it reads byte by byte. This gave me some nice improvement.
As a rule, anything custom coded should be much faster than a generic database, assuming you have coded it efficiently.
There are specific C-libraries to solve this problem using B-trees. In the old days there was a famous library called "B-trieve" that was very popular because it was fast. In this application a B-tree will be faster and easier than fooling around with a database.
If you want optimal performance you would use a data structure called a suffix tree. There are libraries which are designed to create and use suffix trees. This will give you the fastest word lookup possible.
In either case there is no reason to store the entire dataset in memory, just store the B-tree (or suffix tree) with an offset to the data in memory. This will require about 3 to 5 megabytes of memory. When you query the tree you get an offset back. Then open the file, seek forwards to the offset and read the vector off disk.
You could use a simple text based index file just mapping the words to indices, and another file just containing the raw vector data for each word. Initially you just read the index to a hashmap that maps each word to the datafile index and keep it in memory. If you need the data for a word, you calculate the offset in the data file (2000 * 32 * index) and read it as needed. You probably want to cache this data in RAM (if you are in java perhaps just use a weak map as a starting point).
This is basically implementing your own primitive database, but it may still be preferable because it avoidy database setup / deployment complexity.
I'm working on creating a binary search algorithm in C that searches for a string in a .txt file. Each line is a string representing a stock ticker. Not being familiar with C, this is taking far too long. I have a few questions:
1.) Once I have opened a file using fopen, does it make more sense in terms of efficiency for the algorithm to step through the file using some function provided in the C library for scanning files, doing the compare directly from the file, or should I copy each line into an array and have the algorithm search the array?
2.) If I should compare directly from the file, what is the best way to step through it? Assume I have the number of lines in the file, is there some way to go directly to the middle line, scan the string and do the compare?
I'm sorry if this is too vague. Not too sure how to better explain. Thanks for your time
Unless your file is exceedingly big (> 2GB) then loading the file in memory prior searching it is the way to go. In case you cannot load the file in memory, you could hold the offset of each line in an int[] or (if the file contains too many lines...) create another binary file and write the offset of each lines as integers...
Having everything in memory is by far preferable, though.
You cannot binary search lines of a text-file without knowing the length of each line in advance, so you'll most likely want to read each line into memory at first (unless the file is very big).
But if your goal is only to search for a single given line as quickly as possible, you might as well just do linear search directly on the file. There's no point in getting O(log n) at the cost of a O(n) setup cost if the search is only done once.
Reading it all in with a bulk read and walking through it with pointers (to memory) is very fast. Avoid doing multiple I/O calls if you can.
I should also mention that memory mapped files can be very suitable for something like this. See mmap() if on Unix. This is definitely your best bet for really large files.
This is a great question!
The challenge of binary search is that the benefits of binary search come from being able to skip past half the elements at each step in O(1). This guarantees that, since you only do O(lg n) probes, that the runtime is O(lg n). This is why, for example, you can do a fast binary search on an array but not a linked list - in the linked list, finding the halfway point of the elements takes linear time, which dominates the time for the search.
When doing binary search on a file you are in a similar position. Since all the lines in the file might not have the same length, you can't easily jump to the nth line in the file given some number n. Consequently, implementing a good, fast binary search on a file will be a bit tricky. Somehow, you will need to know where each line starts and stops so that you can efficiently jump around in the file.
There are many ways you can do this. First, you could load all the strings from the file into an array, as you've suggested. This takes linear time, but once you have the array of strings in memory all future binary searches will be very fast. The catch is that if you have a very large file, this may take up a lot of memory, and could be prohibitively expansive. Consequently, another alternative might be not to store the actual stings in the array, but rather the offsets into the file at which each string occurs. This would let you do the binary search quickly - you could seek the file to the proper offset when doing a comparison - and for large stings can be much more space-efficient than the above. And, if all the strings are roughly the same length, you could just pad every line to some fixed size to allow for direct computation of the start position of each line.
If you're willing to expend some time implementing more complex solutions, you might want to consider preprocessing the file so that instead of having one string per line, instead you have at the top of the file a list of fixed-width integers containing the offsets of each string in the file. This essentially does the above work, but then stores the result back in the file to make future binary searches much faster. I have some experience with this sort of file structure, and it can be quite fast.
If you're REALLY up for a challenge, you could alternatively store the strings in the file using a B-tree, which would give you incredibly fast lookup times fir each string by minimizing the number of disk reads that you need to do.
Hope this helps!
I don't see how you can do compare directly from the file. You will have to have a buffer to store data read from disk and use that buffer. So it doesn't make sense, it is just impossible.
You cannot jump to a particular line in the file. Not unless you know the offset in bytes of the beginning of that line relative to the beginning of the file.
I'd recommend using mmap to map this file directly into memory and work with it as with character array. Operating system will make work with file (like seeking, reading, writing) transparent to you, and you will just work with it like with a buffer in memory. Note that mmap is limited to 4 GB on 32-bit systems. But if that file is bigger, you probably need to ask the question - why on earth someone has this big file not in an indexed database.
at the moment i am trying to write a unreal amount of data out to files,
basically i generate a new struct of data and write it out to file untill the file becomes 1gb big and this occurs for 6 files of 1gb each, the structs are small. 8 bytes long with two 2 variables id and amount
when i generate my data, the structs are created and written to file in the order of amount.
but i need the data to sorted by id.
remember there is 6gb's of data , how could i sort these structs by there id value and then written to file?
or should i write to file first, and then sort each individual file ,and how would i bring all this data together into one file?
i am kind of stuck , because i would like to hold it in an array , but obviously this amount of data is too big.
i need a good way to sort alot of data? (6gb)
I haven't found a question with a really basic answer on this, so here goes.
If you're on a 64 bit machine, by the way, you should seriously consider writing all the data into a file, memory mapping the file, and just use whatever array sort you like. Quicksort is pretty cache-friendly: it won't thrash badly. The assignment is probably designed to stop you doing this, but might be a bit out of date ;-)
Failing that, you need some kind of external sort. There are other ways to do it, but I think merge sort is probably the simplest. Before you start merging:
work out how much data you can fit into memory (or, again, mmap it). If you're on a PC then 1GB seems like a fair assumption, but it may be a few times more or less.
load this much data (so one of your 6 files, in the example)
quicksort it (since you tagged "quicksort", I guess you know how to do that), or any other sort of your choice.
write it back to disk (if you didn't mmap).
This leaves you with 6 1GB files, each of which individually is sorted. At this point you can either work up gradually, or go for the whole lot in one go. With 6 chunks, going for the whole lot is fine, in what is called a "6-way merge":
open a file for writing
open your 6 files for reading, and read a few million records out of each
examine the 6 records at the start of each of the 6 buffers. One of theses 6 must be the smallest of all. Write this to the output, and move forward one step through that buffer.
as you reach the end of each buffer, refill it from the correct file.
There's some optimization you can do regarding how you work out which of your 6 possibilities is the smallest, but the big performance difference will be to make sure you use large enough read and write buffers.
Obviously there's nothing special about the merge being 6-way. If you'd rather stick to a 2-way merge, which is easier to code, then of course you can. It will take 5 2-way merges to merge 6 files.
I would recommend this tool, it is a light weight database that runs in memory and takes up very little memory. It will hold your information and you can query it to retrieve your information.
http://www.sqlite.org/features.html
I suggest you don't.
If you are to hold such amount of data, why not using a dedicated database format that can have lots of different indexes and a powerful request engine.
But if you still want to use your old fashioned fixed-endian struct, then i would suggest breaking your data into smaller files, sort each one, and merge them. A good merge algorithm runs in nlog(q). Be also sure to pick the right algorithm for your files.
The easiest way (in development time) to do this is to write out the data to separate files according to their ID. You don't have to have a 1 to 1 match between the number of files and the number of IDs (in case there are a lot of IDs), but if you choose a prefix of the ID (so if the key for one particular record is 987 it might go in the 9 file while the record with key 456 would go in the 4 file) you won't have to worry about locating all of the keys across all of the files because sorting each file by itself would result and then looking at the files in their order (by their names) would give you sorted results.
If that is not possible or easy the you need to do an external sort of some type. Since the data is still spread across several files this is a bit of a pain. The easiest thing (by development time) is to first sort each individual file independently and then merge them together into a new set of files sorted by ID. Look up merge sort if you don't know what I'm talking about. At this step you are pretty much starting in the middle of merge sort.
As far as sorting the contents of a file which is too large to fit into RAM you can either use merge sort directly on the file or use replacement selection sort to sort the file in place. This involves making several passes over the file while using some RAM (the more the better) to hold a priority queue (a binary heap) and a set of records that are not possibly of any use in this run (their keys suggest that they should be earlier in the file than the current run position, so you're just holding on to them until the next run).
Searching for replacement selection sort or tournament sort will yield better explanations.
First, sort each file individually. Either load the whole thing into memory, or (better) mmap it, and use the qsort function.
Then, write your own merge sort that takes N FILE * inputs (i.e. N=6 in your case) and outputs to N new files, switching to the next one whenever one fills up.
Check out external sort. Find any of the external mergesort libraries out there and modify them to suit your need.
Well - since the actual assignment is to keep encoded data and later just compare it with decoded-data, I would also say - use a database and just create an hash index on the ID column.
But regarding sort of such hugh number, another very important thing is to do it in parallel. There are many ways to do it. Steve Jessop mentioned a sort-merge approach, it is really easy to sort the first 6 chunks in parallel, the only question is how much cpu cores andd memory you have on your machine. (It is rare to find a computer with only 1 core today and also not so rare to have 4GB memory).
Maybe you could use mmap and use it as a huge array which you could sort with qsort. I'm not sure what the implications would be. Would it grow to much in memory?
For example, let's say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste of time to copy the entire file into an array and then run binary search...I've effectively made it a linear time algorithm, because I'll have to spend O(n) time copy the darn file before I can run my search.
Is there a faster way to do this? Is there maybe something like lseek which works with lines instead of bytes?
If there isn't, am I better off just doing a linear search instead (assuming I'm only running the search once for the entire duration of my program) ?
You cannot seek by line. It's pretty obvious once you think about it.
But you can do a sort-of binary search on a text file.
What you do is:
Stat the file to get the length or seek to the end and get the position.
Memory map the file.
(This is best, I think, but you can use lseek and read if you must.)
Seek to the middle of the file, minus your average line length. Just guess.
Scan forward for a newline, unless you are at position 0.
Read your line and compare.
Repeat for 1/4th or 3/4ths, 1/8th, 1/16th, etc.
A disk-based binary search needs to be, at least initially, "block-aware", i.e. aware of the fact that whether you read a single byte of a whole bunch, the I/O cost are the same. The other think it need to be aware is of the relative higher cost for a seek operation as compared to a sequential read operation.
Several of the ways that it can use this awareness about the characteristics of disk I/O:
Towards the end of the search, favor linear searching (scanning) rather than seeking into.
In the beginning check both the first and last element in the block, this may help extrapolate a better guess for the next split
Cache a tree (or even short flat list), of some of the items found in various places in the file (a bit like the intermediate nodes in a formal btree structure)
Declare and use an appropriate buffer size
If the file is small, like under a few hundred kilobytes, it's almost certainly faster to read (or virtually memory map) the entire file into memory. This is because the overhead of doing several i/o operations to seek and transfer is much worse than just reading the whole file, which is what most programs do and most operating systems assume is done.
Unless all the lines are the same length, or have a very predictable length, there's no easy way to seek to line #n. But, to perform a binary search, I'd work with byte offsets in the binary search and read, say 100 bytes (if the words are all less than 100 characters long) before and after the offset—a total of 200 bytes. Then scan for the newline before and after the middle of it to extract the word.
Yes you can lseek but it would help if the size of each word/number per line is fixed, if that is not the case, which is more likely, then you have to lseek by the size of file and seek to the nearest word beginning to still achieve close to the typical O(log n) time complexity of binary searches.
There wouldn't be a "lseek" function, because the file commands do not have the concept of a "line" This concept exists in a different layer of abstraction then the raw file commands.
As to whether it's faster or not, the answer will depend upon a number of factors, including the size of the file, the disk drive speed, and the amount of RAM available. If it isn't a large file, my guess is it would be faster to load the entire file into memory.
If it is a large file, I would use the binary search algorithm to narrow it down to a smaller range (say, a couple of megabytes), then load up that entire block.
As mentioned above, since the file is a text file, predicting the byte at which a given line begins within the file can't be done reliably. The ersatz binary search idea is a pretty good one. But it really won't save you a ton unless the file is huge, given how fast sequential I/O is nowadays and how slow random I/O is.
As you mention, if you are going to read it in, you might as well linearly search it as you go. So do so, use a modified Boyer-Moore search as you read it in and you'll do pretty well.
There are so many performance tradeoffs here that it's impossible to know what makes sense until you have measurements on typical data.
If you're going to maintain this code, it needs to be simple. If searches are rare or the file is small, go with linear search. If the cost actually matters, you'll have to do some experiments.
The second thing I would try after linear search would be to mmap the file and scan through it for newlines. This does take linear time, but strchr can be very fast. It helps if you can guarantee the file ends in a newline. Once you have the lines demarcated, you can keep the number of comparisons small by doing a binary search.
Another option you should consider is Boyer-Moore string search. This is a sub-linear time search and depending on the size of the search pattern, it may be faster than the logarithmic binary search. Boyer-Moore is especially good with long search strings.
Finally, if you determine binary search is really good, but that identifying the lines is a performance bottleneck, you could precompute the start location of each line and store these precomputed locations in binary format in an auxiliary file.
I feel comfortable making only one prediction: it is almost certainly worth avoiding reading in one line at a time with something like readline() or fgets(), because this strategy invariably involves calling malloc() to hold the contents of the line. The cost of calling malloc() on every line is likely to swamp any cost of search or comparison.