how to find the most frequent number in 1T numbers? - c

How to find the most frequent number(int type) in 1T (i.e. 10^12) numbers?
My premises are:
My memory is limited to 4G (i.e. 4·10^9) bytes.
All the numbers are stored in a file as the input.
The output is just one number.
All numbers(int type) are stored in one or serval files
The file structure is eithor binary or line-stored.
Edited at : 2013.04.22 17:08
Thanks for you comments:
Plus:
- External Storage is not limited.

First note that the problem is at least as hard as the element distinctness problem.
Thus, the solutions should follow the same approaches:
sort (using external sort) and iterate while counting occurances for each number and looking for the maximal.
Hashing solution: hash the numbers into buckets that fit in memory (note that all occurances of the same number will be hashed to the same bucket), for each bucket - find the most frequent number, and store it. Then go through all candidates from all buckets and chose the best.
In here, you can either sort (in memory) each bucket and find the most frequent number or you can create a histogram (using a hash map, with a different hash function) to find the frequency of each item in the bucket.
Note that the buckets are written on disk, and loaded into memory one after the other, at each time only a small part of the data is stored on RAM.
Another more scalable approach could be using map-reduce, with a simple map-reduce step to count number of occurances per number, and then just find maximum of those:
map(number):
emit(number,'1')
reduce(number,list):
emit (number, size(list))
all is left is to find the number with the highest value - which can be done in linear scan.

What's about using filesystem to store counters of numbers?
For example, if your numbers are uint32, you can create 65536 directories with 65536 files in each.
Name of directory will be two high bytes of a number, name of file - low two bytes. When you meet number X, you split it into two parts and get filename, open that file and increment counter inside it (or write there 1, if file is absent).
After filling that file structure you can scan recursively your tree finding file with greatest value.
That would be very slowly, but will almost eat none RAM.

Use a hash-table the key is the number, the value is the count. O(n) to insert all the numbers in the hashtable O(Unique numbers) to find the most frequent.

brute force:
remember = 0;
repeat:
take the first unmarked number and count its occurance in the file (n1).
mark each occurance of number as read. (overwrite it with blanks f.e)
if (n1 > remember) remember = n1;

Related

Appropriate data structure for counting frequency of string

I have a task of counting frequency of strings(words) in a text file. What data structure do you think is appropriate(based on implementation difficulty, memeory usage and time complexity of algorithm)? I have hash-table, bunary search tree and heap in mind but I don't know which one to choose? Also if there is any better data structure than the ones I mentioned, it will be great too. Thanks in advance.
N.B. the text file could be extremely large.
Because you say the file could be extremely large, I assumed you can't keep all the words in memory simultaneously.
Note that if the file had all words sorted, finding the frequencies would require keeping only the counter and two last words in memory at a time to compare them. As long as the same word as before is read, increment the counter. When you hit a different word, save the previous word and its count to another file with the frequencies and start counting over for the new word.
So the question is how to sort words in a file. For that purpose, you can use merge sort. Note that when merging subarrays, it's needed to keep only two words in memory, one per subarray. Additionally, you will need to create an extra file, like an extra array in in-memory merge sort, and play with positions in files. If you write to the original and extra files alternately in recursive calls, these two will be enough.

How do I search most common words in very big file (over 1 Gb) wit using 1 Kb or less memory?

I have very big text file, with dozens of millions of words, one word per line. I need to find top 10 most common words in that file. There is some restrictions: usage of only standard library and usage of less that 1 KB of memory.
It is guaranteed that any 10 words in that file is short enough to fit into said memory limit and there will be enough memory to some other variables such as counters etc.
The only solution I come with is to use another text file as additional memory and buffer. But, it seems to be bad and slow way to deal with that problem.
Are there any better and efficient solutions?
You can first sort this file (it is possible with limited memory, but will require disk IO of course - see How do I sort very large files as starter).
Then you will be able to read sorted file line by line and calculate frequency of each word one by one - store them, after 10 words - if frequency is higher then all stored in your array - add it to internal array and remove least occurred one, thus you will keep only 10 most frequent words in memory during this stage.
As #John Bollinger mentioned - if your requirment is to print all top 10 words, if for example - all words from files have the same frequency, i.e. they all are "top", then this approach will not work, you need to calculate frequency for each word, store in file, sort it and then print top 10 including all words with the same frequency as 10th one.
If you can create a new file however big, you can create a simple disk-based tree database holding each word and its frequency so far. This will cost you O(log n) each time, with n from 1 to N words, plus the final scan of the whole N-sized tree, which adds up to O(N log N).
If you cannot create a new file, you'll need to perform a in-place sorting of the whole file, which will cost about O(N2). That's closer to O((N/k)2), I think, with k the average number of words you can keep in memory for the simplest bubble-sort - but that is O(1/k2)O(N2) = K O(N2) which is still O(N2). At that point you can rescan the file one final time, and after each run of each word you'll know whether that word can enter your top ten, and at which position. So you need to fit just twelve words in memory (the top ten, the current word, and the word just read from the file). 1K should be enough.
So, the auxiliary file is actually the fastest option.

Reading specific line in txt file with C

I am working with Mac OSX, programming in C and using bash in terminal.
I am currently trying to make a lookup table for the gamma function. Calling gsl_sf_gamma I have been told is pretty expensive and a lookup table would be far faster. I did not wish to lose too much accuracy so I wanted to have a fairly large lookup table. Initializing a huge array would not be ideal since it then defeats the purpose.
My thoughts where to make a large text file with the values pre evaluated for the gamma function in the range of interest. A major problem with this is that I don't know how to call a specific line within a text file using C.
Thanks for any insight and help you guys can offer.
Warning: I know very little about strings and txt files, so I might just not know a simply function that does this already.
Gamma is basically factorial except in continuous form. You want to perform a lookup rather than a computation for the gamma function. You want to use a text file to represent these results. Each line of the file represents the input value multiplied by 1000. I guess for a high enough input value, the file scan could outperform doing the compute.
However, I think you will at minimum want to compute an index into your file. The file can still be arranged as a text file, but you have another step that scans the file, and notes the byte offset for each result line. These offsets get recorded into a binary file, which will serve as your index.
When you run your program, in the beginning, you load the index file into an array, which the index of the array is the floor of the gamma input multiplied by 1000, and the array value at that index is the offset that is recorded in the index file. When you want to compute gamma for a particular number, you multiply the input by 1000, and truncate the result to obtain your array index. You consult this array for the offset, and the next array value for to compute the length of the input. Then, your gamma text file is opened as a binary file. You seek to the offset, and read the length number of bytes to get your digits. You will need to read the next entry too to perform your interpolation.
Yes, calculating gamma is slow (I think GSL uses the Lancosz formula, which sums a series). If the number of values for which you need to calculate it is limited (say, you're only doing integers), then certainly a lookup table might help. But if the table is too big for memory, it won't help--it will be even slower than the calculation.
If the table will fit into memory, there's nothing wrong with storing it in a file until you need it and then loading the whole thing into memory at once.

storing strings in an array in a compact way [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Fast string search using bitwise operators

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?
e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)
*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits:
'G':00, 'A':01, 'T':10, 'C':11
*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters
*by fastest I mean not including any pre-processing/indexing time
*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.
*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.
*KMP wouldn't work well as the alphabet is small
*C code, x86 machine code would all be interesting.
Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html
Related:
Fastest way to scan for bit pattern in a stream of bits
Algorithm help! Fast algorithm in searching for a string with its partner
http://www.arstdesign.com/articles/fastsearch.html
http://en.wikipedia.org/wiki/Bitap_algorithm
If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.
I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.
If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).
What this means is that, once you compress your DNA:
You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.
As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.
Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.
Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)
It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.
The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.
This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.
You could create a state machine. In this topic,
Fast algorithm to extract thousands of simple patterns out of large amounts of text
, I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).
BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.
Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.
select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.
create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.
scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.
reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.
I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.
I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

Resources