There is a quite large file (>10G) on the disk, each line inside the fie is composed of a line-number and a person's name, like this:
1 Jane
2 Perk
3 Sime
4 Perk
.. ..
I have to read this large file, and find the frequency of each name, finally output the results in descending order of each name's frequency, like this:
Perk 2
Jane 1
Sime 1
As the interviewer requested, the above job should be done as efficiently as possible, and multithreading is allowed. And my solution is something like this:
Because the file is too large, I partition the file into several small files, each small file is about 100M, via lseek I can locate the begin and the end of each small file (beg, end);
For these small files, there is a shared hash-map using person's name as key and how many times it shows so far as value;
For each small file, there is a single thread go through it, every time the thread encounters a person's name, it will increment its corresponding value in the shared hash-map;
When all threads finish, I think it's time to sort the hash-map according to the value field.
But because there might be too many names in that file, so the sorting would be slow. I didn't come up with a good idea about how to output the names in descending order.
Hope anyone can help me with the above problem, give me a better solution on how to do the job via multithreading and the sorting stuff.
Using a map-reduce approach could be a good idea for your problem. That approach would consist of two steps:
Map: read chunks of data from the file and create a thread to process that data
Reduce: the main thread waits for all other threads to finish and then it combines the results from each individual thread.
The advantage of this solution is that you would not need locking between the threads, since each one of them would operate on a different chunk of data. Using a shared data structure, as you are proposing, could be a solution too, but you may have some overhead due to contention for locking.
You need to do the sorting part at the reduce step, when the data from all the threads is available. But you might want to do some work during the map step, so that it is easier (quicker) to finish the complete sort at the reduce step.
If you prefer to avoid the sequential sorting at the end, you could use some custom data structure. I would use a map (something like a red-black tree or a hash table) for quickly finding a name. Moreover, I would use a heap in order to keep the order of frequencies among names. Of course, you would need to have parallel versions of those data structures. Depending on how coarse the parallelization is, you may have locking contention problems or not.
If I asked that as an interview question using the word "efficiently" I would expect an answer something like "cut -f 2 -d ' ' < file | sort | uniq -c" because efficiency is most often about not wasting time solving an already solved problem. Actually, this is a good idea, I'll add something like this to our interview questions.
Your bottleneck will be the disk so all kinds of multithreading is overdesigning the solution (which would also count against "efficiency"). Splitting your reads like this will either make things slower if there are rotating disks or at least make the buffer cache more confused and less likely to kick in a drop-behind algorithm. Bad idea, don't do it.
I don't think multithreading is a good idea. The "slow" part of the program is reading from disk, and multithreading the read from disk won't make it faster. It will only make it much more complex (for each chunk you have to find the first "full" line, for example, and you have to coordinate the various threads, and you have to lock the shared hash map each time you access it). You could work with "local" hash map and then merge them at the end (when all the threads finish (at the end of the 10gb) the partial hash maps are merged). Now you don't need to sync the access to the shared map.
I think that sorting the resulting hash map will be the easiest part, if the full hash map can be kept in memory :-) You simply copy it in a malloc(ed) block of memory and qsort it by its counter.
Your (2) and (4) steps in the solution make it essentially sequential (the second introduces locking to keep the hash-map consistent, and the last one, where you're attempting to sort all the data).
One-step sorting of the hash-map at the end is a little strange, you should use an incremental sorting technique, like heapsort (locking of the data structure required) or mergesort (sort parts of the "histogram" file, but avoid merging everything "in one main thread at the end" - try to create the sorting network and mix the contents of the output file at each step of the sorting).
Multi-threaded reads might be an issue, but with modern SSD drives and aggressive read caching multi-threading is not the main slowdown factor. It's all about synchronizing the results sorting process.
Here's a sample of mergesort's parallelization: http://dzmitryhuba.blogspot.com/2010/10/parallel-merge-sort.html
Once again, as I have said, some sorting network might help to allow efficient parallel sort, but not the straightforward "wait-for-all-subthreads-and-sort-their-results". Maybe, bitonic sort in case you have a lot of processors.
The interviewer's original question states "...and multithreading is allowed". The phrasing of this question might be a little ambiguous, however the spirit of the question is obvious: the interviewer is asking the candidate to write a program to solve the problem, and to analyse/justify the use (or not) of multithreading within the proposed solution. It is a simple question to test the candidate's ability to think around a large-scale problem and explain algorithmic choices they make, making sure the candidate hasn't just regurgitated something from an internet website without understanding it.
Given this, this particular interview question can be efficiently solved in O(n log n) (asymptotically speaking) whether multithreading is used or not, and multi-threading can additionally be used to logarithmically accelerate the actual execution time.
Solution Overview
If you were asked the OP's question by a top-flight company, the following approach would show that you really understood the problem and the issues involved. Here we propose a two stage approach:
The file is first partitioned and read into memory.
A special version of Merge Sort is used on the partitions that simultaneously tallies the frequency of each name as the file is being sorted.
As an example, let us consider a file with 32 names, each one letter long, and each with an initial frequency count of one. The above strategy can be visualised as follows:
1. File: ARBIKJLOSNUITDBSCPBNJDTLGMGHQMRH 32 Names
2. A|R|B|I|K|J|L|O|S|N|U|I|T|D|B|S|C|P|B|N|J|D|T|L|G|M|G|H|Q|M|R|H 32 Partitions
1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1 with counts
3. AR BI JK LO NS IU DT BS CP BN DJ LT GM GH MQ HR Merge #1
11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 and tally
4. ABRI JKLO INSU BDST BCNP DJLT GHM HMQR Merge #2
1111 1111 1111 1111 1111 1111 211 1111 and tally
5. ABIJKLOR BDINSTU BCDJLNPT GHMQR Merge #3
11111111 1111211 11111111 22211 and tally
6. ABDIJKLNORSTU BCDGHJLMNPQRT Merge #4
1212111111211 1112211211111 and tally
7. ABCDGHIJKLMNOPQRSTU Merge #5
1322111312132113121 and tally
So, if we read the final list in memory from start to finish, it yields the sorted list:
A|B|C|D|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U
-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
1|3|2|2|1|1|1|3|1|2|1|3|2|1|1|3|1|2|1 = 32 Name instances (== original file).
Why the Solution is Efficient
Whether a hash table is used (as the original poster suggested), and whether multi-threading is used or not, any solution to this question cannot be solved more efficiently than O(n log n) because a sort must be performed. Given this restriction, there are two strategies that can be employed:
Read data from disk, use hash table to manage name/frequency totals, then sort the hash table contents (original poster's suggested method)
Read data from disk, initialise each name with its frequency total from the file, then merge sort the names simultaneously summing all the totals for each name (this solution).
Solution (1) requires the hash table to be sorted after all data has been read in. Solution (2) performs its frequency tallying as it is sorting, thus the overhead of the hash table has been removed. Without considering multithreading at all, we can already see that even with the most efficient hash table implementation for Solution (1), Solution (2) is already more efficient as it doesn't have the overhead of the hash table at all.
Constraints on Multithreading
In both Solution (1) and Solution (2), assuming the most efficient hash table implementation ever devised is being used for Solution (1), both algorithms perform the same asymptotically in O(n log n); it's simply that the ordering of their operations is slightly different. However, while multithreading Solution (1) actually slows its execution down, multithreading Solution (2) will gain substantial improvements in speed. How is this possible?
If we multithread Solution (1), either in the reading from disk or in the sort afterwards, we hit a problem of contention on the hash table as all threads try to access the hash table simultaneously. Especially for writing to the table, this contention could cripple the execution time of Solution (1) so much so that running it without multithreading would actually give a faster execution time.
For multithreading to give execution time speed ups, it is necessary to make sure that each block of work that each thread performs is independent of every other thread. This will allow all threads to run at maximum speed with no contention on shared resources and to complete the job much faster. Solution (2) does exactly this removing the hash table altogether and employing Merge Sort, a Divide and Conquer algorithm that allows a problem to be broken into sub-problems that are independent of each other.
Multithreading and Partitioning to Further Improve Execution Times
In order to multithread the merge sort, the file can be divided into partitions and a new thread created to merge each consecutive pair of partitions. As names in the file are variable length, the file must be scanned serially from start to finish in order to be able to do the partitioning; random access on the file cannot be used. However, as any solution must scan the file contents at least once anyway, allowing only serial access to the file still yields an optimal solution.
What kind of speed-up in execution times can be expected from multithreading Solution (2)? The analysis of this algorithm is quite tricky given its simplicity, and as been the subject of various white papers. However, splitting the file into n partitions will allow the program to execute (n / log(n)) times quicker than on a single CPU with no partitioning of the file. Simply put, if a single processor takes 1 hour to process a 640GB file, then splitting the file into 64 10GB chunks and executing on a machine with 32 CPUs will allow the program to complete in around 6 minutes, a 10 fold increase (ignoring disk overheads).
Related
I am in the process of developing a performance-critical network service in Rust. A request to my service looks like a vector ids: Vec<u64> of numerical ids. For each id in ids, my service must read the id-th record from a long sequence of records stored contiguously on an SSD. Because all records have the same size RECORD_SIZE (in practice, around 6 KB), the position of every record is entirely predictable, so a trivial solution reduces to
for id in ids {
file.seek(SeekFrom::Start(id * RECORD_SIZE)).unwrap();
let mut record = vec![0u8; RECORD_SIZE];
file.read_exact(&mut record).unwrap();
records.push(record);
}
// Do something with `records`
Now, sadly, the following apply:
The elements of ids are non-contiguous, unpredictable, unstructured, and effectively equivalent to distributed uniformly at random in the range [0, N].
N is way too large for me to store the entire file in memory.
ids.len() is much smaller than N, so I cannot efficiently cycle through the file linearly without having 99% of my reads be for records that have nothing to do with ids.
Now, reading the specs, the raw QD32 IOPS of my SSD should allow me to collect all records in time (i.e., before the next request comes). But, what I observe with my trivial implementation is much much worse. I suspect that this is due to that being effectively a QD1 implementation:
Read something from disk at a random location.
Wait for the data to arrive, store it in RAM.
Read the next thing from disk at another, independent location.
Now, the thing is I know all ids at the very beginning, and I would love it if there was a way to specify:
As much in parallel as possible, read all the locations relevant to each element of ids.
When that is done, carry on doing something on everything.
I am wondering if there is an easy way to get this done in Rust. I scouted for file.parallel_read-like functions in the standard library, for useful crates on crates.io, but to no avail. Which puzzles me because this should be a relatively common problem in a server / database setting. Am I missing something?
Depending on the architecture you're targeting, there is the posix_fadvise syscall:
Programs can use posix_fadvise() to announce an intention to access file data in a specific pattern in the future, thus allowing the kernel to perform appropriate optimizations.
You would pass the offset, RECORD_SIZE, and probably the POSIX_FADV_WILLNEED advise. Both the function and constant are available in the libc crate. This same idea can be done with memory mapped files using posix_madvise() and POSIX_MADV_WILLNEED as hinted in the comments.
You then will need to do some performance tuning to determine how far ahead to make these calls. Not early enough and the data isn't there when you want it, and too early means you're needlessly adding pressure on your system memory.
I was recently posed a simple question to which I responded with an unusual answer. I suspect my answer was particularly bad but am not certain what the performance characteristics truly would be.
Suppose you are given inputs each in the form of a hash code (just a bunch of bits.) Uniquely corresponding to each hash code is an integer value which you would like to return. Your system knows the most likely queries and caches them in memory. For the remaining, less frequent lookups you will have to access a hard drive (disk I/O.) There exists a least recently used policy to replace the cache in memory but that shouldn't be terribly important here.
For the hashes on disk, the conventional way to store them would be in a Database (keyed on the hashes) in a tree shape. This would grant you O(log(n)) lookup time once at the database stage.
My answer seemed odd to the asker and a little odd to me. Suppose instead of a database, you simply kept the values on disk in a file system with a directory structure that exactly mirrored the bits of the hash values. For instance, if we had three bit hashes (and only had entries for 100 => 42 and 010 => 314159 your file system would look like:
\0
.\00
.\000
.\100
42.justanumber
.\10
.\010
314159.justanumber
.\110
\1
.\01
.\001
.\101
.\11
.\011
.\111
The x.justanumber files are empty. The filenames themselves contain the information you're looking for.
Further assume that updates never occur (the entire DB/file system is re-written weekly.) I'd think that a filesystem set up this way would give you O(1) lookup time instead of the O(log(n)) lookup time of a tree-based DB. Am I missing something? Why would this not be preferable?
I believe I've come to an answer.
If the system always uses a folder for every possible bit combination and you have a 32-bit hash, you will have 2^32 folders at the outer layer and roughly the same amount within all inner layers so you'll have 2^33 folders each of size 4kB due to page size limitations. Additionally, each "empty" hash file will occupy 4kB on disk for the same reason. So assuming, let's say, 5% occupancy you would have:
2^33 * 4kB + 2^32 * 4kB * 0.05 = 35.22 TB of storage needed.
The savings are O(1) lookup time instead of O(log(# hashes)) lookup time which almost certainly doesn't justify the storage space requirements. (Also remember this is for the uncommon case where you have a memory/cache miss.)
Admittedly, if you only created the minimum number of folders needed to support the 5% occupancy you'd end up with less space needed. The exact number of folders needed would depend on the distribution of the hashes. Assuming a good hash function, this should be close to random, which I believe means we'll end up needing the majority of the inner layers and definitely 5% of the outermost layers of directories. This instead gives something like:
2^32 * 4 kB * 0.05 + 2^32 * 4kB * 0.40 + 2^32 * 4kB * 0.05 = 8.59 TB
...which is still way too much space for such a simple thing. (I made up the 40% figure for the inner folders. If anyone can come up with a rigorous figure there, please comment/answer.)
Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?
As Per Roger Dahl's great answer:
The functionality you're looking for is called stream compaction.
You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.
However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.
I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter.
What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s.
The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)).
I have time and could create anything from zero. The difficulty is not important to me.
If you can use anything, what you should do ?
Hashtables are the way to go.
They have averaged O(1) complexity for lookup, insertion and deletion!
They tend to occupy more memory than trees but are much faster.
Since you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast.
You can just use a sorted array. Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip.
The implementation is very simple, perhaps too much :D
Binary trees have complexity of O(log n) for lookup, insertion and deletion.
A simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement.
AVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list!
If instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries.
However these trees can be slower if not well written\balanced.
Hashtable are always better for simple lookups! They are too fast to be real :)
Now about synchronization:
If you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking.
If you need to update it not very often, I would suggest you the use reader-writer locks.
If you need very frequent updates I would suggest you to use a concurrent hashtable.
Warning: don't write your own, they are very complicated and bug prone, find an implementation on the web!
They use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking.
Using them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive.
See http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS.
Concurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf
I just found that concurrent radix tree exists:
ftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too.
Concurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update.
Consider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer.
You need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored.
Hashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time.
For both trees and hashtable, the space complexity is linear.
I hope this is a multithreading application and not a multiprocess application (fork).
If it is not multithreading you cannot share a portion of memory in a fast and reliable way.
One way to improve the performance of such a system is to use a Bloom Filter. This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not.
When you want to look up an IP address, you first check in the Bloom Filter. If there's a miss, you can allow the traffic right away. If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree).
You could also create a small cache of "hits in the Bloom Filter but actually allowed" addresses, that is checked after the Bloom Filter but before the authoritative data structure.
Basically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied).
The "most efficient" is a hard term to quantify. Clearly, if you had unlimited memory, you would have a bin for every IP address and could immediately index into it.
A common tradeoff is using a B-tree type data structure. First level bins could be preset for the first 8 bits of the IP address, which could store a pointer to and the size of a list containing all currently blocked IP addresses. This second list would be padded to prevent unnecessary memmove() calls and possibly sorted. (Having the size and the length of the list in memory allows an in-place binary search on the list at the slight expensive of insertion time.)
For example:
127.0.0.1 =insert=> { 127 :: 1 }
127.0.1.0 =insert=> { 127 :: 1, 256 }
12.0.2.30 =insert=> { 12 : 542; 127 :: 1, 256 }
The overhead on such a data structure is minimal, and the total storage size is fixed. The worse case, clearly, would be a large number of IP addresses with the same highest order bits.
I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory.
What's the easiest way to do this? I rate ease over pure speed for this. I'd prefer not to use any libraries and the result will need to run on a modest netbook under both Windows and Linux.
Just allocate a buffer and call qsort. 32MB isn't so very big these days even on a modest netbook.
If you really must split it up: sort smaller chunks, write them to files, and merge them (a merge takes a single linear pass over each of the things being merged). But, really, don't. Just sort it.
(There's a good discussion of the sort-and-merge approach in volume 2 of Knuth, where it's called "external sorting". When Knuth was writing that, the external data would have been on magnetic tape, but the principles aren't very different with discs: you still want your I/O to be as sequential as possible. The tradeoffs are a bit different with SSDs.)
32 MB? thats not too big.... quicksort should do the trick.
Your best option would be to prevent having the data unordered if possible. Like it has been mentioned, you'd be better of reading the data from disk (or network or whatever the source) directly into a selforganizing container (a tree, perhaps std::set will do).
That way, you'll never have to sort through the lot, or have to worry about memory management. If you know the required capacity of the container, you might squeeze out additional performance by using std::vector(initialcapacity) or call vector::reserve up front.
You'd then best be advised to use std::make_heap to heapify any existing elements, and then add element by element using push_heap (see also pop_heap). This essentially is the same paradigm as the self-ordering set but
duplicates are ok
the storage is 'optimized' as a flat array (which is perfect for e.g. shared memory maps or memory mapped files)
(Oh, minor detail, note that sort_heap on the heap takes at most N log N comparisons, where N is the number of elements)
Let me know if you think this is an interesting approach. I'd really need a bit more info on the use case