fast indexing for slow cpu? - arrays

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.

I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.

I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).

Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).

There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)


How to store vocabulary in an array more effectively?

I've got a vocabulary, a, abandon, ... , z.
For some reason, I will use array rather than Trie to store them.
Thus a simple method can be: wordA\0wordB\0wordC\0...word\0
But there are some more economic methods for memory I think.
Since like is a substring of likely, we can only store the first position and length of like instead of the string itself. Thus we generate a "large string" which contains every words in vocabulary and use position[i] and length[i] to get the i-th word.
For example, vocabulary contains three words ab, cd and bc.
I construct abcd as the "large string".
position[0] = 0, length[0] = 2
position[1] = 2, length[1] = 2
position[2] = 1, length[2] = 2
So how to generate the "large string" is the key to this problem, are there any cool suggestions?
I think the problem is similar to TSP problem(Traveling Salesman Problem), which is a NP problem.
The search keyword you're looking for is "dictionary". i.e. data structures that can be used to store a list of words, and test other strings for being present or absent in the dictionary.
Your idea is more compact than storing every word separately, but far less compact than a good data structure like a DAWG. As you note, it isn't obvious how to optimally choose how to overlap your strings. What you're doing is a bit like what a lossless compression scheme (like gzip) would do. If you don't need to check words against your compact dictionary, maybe just use gzip or LZMA to compress a sorted word list. Let their algorithms find the redundancy and represent it compactly.
I looked into dictionaries for a recent SO answer that caught my interest: Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)
For a dictionary that doesn't have to have new words added on the fly, a Directed Acyclic Word Graph is the way to go. You match a string against it by following graph nodes until you either hit a point where there's no edge matching the next character, or you get to the end of your input string and find that the node in the DAWG is marked as being a valid end-of-word. (Rather than simply a substring that is only a prefix of some words). There are algorithms for building these state-machines from a simple array-of-words dictionary in reasonable time.
Your method can only take advantage of redundancy when a whole word is a substring of another word, or end-of-one, start-of-another. A DAWG can take advantage of common substrings everywhere, and is also quite fast to match words against. Probably comparable speed to binary-searching your data structure, esp. if the giant string is too big to fit in the cache. (Once you start exceeding cache size, compactness of the data structure starts to outweigh code complexity for speed.)
Less complex but still efficient is a Trie (or Radix Trie), where common prefixes are merged, but common substrings later in words don't converge again.
If you don't need to modify your DAWG or Trie at all, you can store it efficiently in a single block of memory, rather than dynamically allocating each node. You didn't say why you didn't want to use a Trie, and also didn't acknowledge the existence of the other data structures that do this job better than a plain Trie.

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Partial heap sorting to find k most frequent words in 5GB file

I know what algorithm I'd like to use but want to know what I'd have to change since the file is so large.
I want to use a hash to store the frequencies of the words and use a min-heap to store the most frequent words and adjust the min-heap accordingly as I loop through the words. This should take O(nlogk) I think. How will my algorithm need to be changed if I have too much data to store in memory. This is an issue I have trouble understanding in general, not just for this specific question but I'm just giving context so that it might help with the explanation.
I think there is no deterministic way to do that without having the entire file in memory (or making some expensive kind of merge sort).
But there are some good probabilistic algorithms. Take a look at the Count-Min Sketch.
There is a great implementation of this and other algorithms, in this library.
Explaining the merge sort thing: if your file is already sorted, you can find the k most frequent pretty easily with a min-heap. Yes, a min-heap to be able to discard the less frequent term when you find one more competitive. You can do this because you can know the frequency of the current word without having to read the entire file. If you file is unsorted, you must keep an entire list, because the most frequent term may appear everywhere in the file, and be discarded as "non-competitive" too soon.
You can do a merge sort with limited memory pretty easily, but it's a I/O intensive operation and may take a while. Actually you can use any kind of External Sort.
Added after your comment that you need to calculate the frequencies.
You don't say how many words you expect are in the data, or what constitutes a word. If it's English text, I'd be surprised to see a half million words. And there certainly won't be a billion words in 5 gigabytes of text. But the technique doesn't really change, regardless of how many words there are.
You start by building a dictionary or hash map that contains key value pairs: word, count. As you read each word, look it up in the dictionary. If it's there, increase its count. If it's not there, add it with a count of 1.
If you have a lot of memory or relatively few words, it'll all fit into memory. If so, you can do the heap thing that I describe below.
If your memory fills up, then you simply write the key value pairs out to a text file, one word per line, like this:
word1, count
word2, count
Then clear your dictionary and keep going, adding words or increasing their counts. Repeat as necessary for each block of words until you've reached the end of the input.
Now you have a huge text file that contains word/count pairs. Sort it by word. There are many external sorting tools that'll do that. Two that come to mind are the Windows SORT utility and the GNU sort. Both can easily sort a very large file of short lines.
Once the file is sorted by word, you'll have:
word1, count
word1, count
word2, count
word3, count
word3, count
word3, count
Now it's a simple matter of going sequentially through the file, accumulating counts for words. At each word break, check its count against the heap as described below.
This whole process takes some time, but it works quite well. You can speed it some by sorting the blocks of words and writing them to individual files. Then, when you've reached the end of input you do an N-way merge on the several blocks. That's faster, but forces you to write a merge program, unless you can find one. If I were doing this once, I'd go for the simple solution. Were I to be doing it often, I'd spend the time to write a custom merge program.
After you've computed the frequencies ...
Assuming your file contains the words and their frequencies, and all you want to do is get the k words with the highest frequencies, then yes it's O(n log k), and you don't have to store all of the items in memory. Your heap only requires k items.
The idea:
heap = new minheap();
for each item
// if you don't already have k items on the heap, add this one
if (heap.count < k)
else if (item.frequency > heap.Peek().frequency)
// The new item's frequency is greater than the lowest frequency
// already on the heap. Remove the item from the heap
// and add the new item.
After you've processed every item, the heap will contain the k items that have the highest frequencies.
You can use selection algorithm ( ) to calculate the kth largestnumber. Then do a linear scan and select only k large numbers.
In practice you might want to start with an estimated range where kth min false into and continue from there on. Eg. read first M numbers and calculate estimated kth max = (k*M/N)th max in M numbers. If you think data is biased (i.e. partially sorted), then pick those M numbers randomly.

How to search a big array for an object?

I had an interview today, I was asked how search for a number inside an array, I said binarysearch, he asked me how about a big array that has thousands of bjects (for example Stocks) searching for example by price of the stocks, I said binarysearch again, he said sorting an array of thousands will take lot of time before applying binarysearch.
Can you please bear with me and teach me how to approach this problem ?
your help is appreciated.
I was asked a similar question.The twist was to search in sorted and then an unsorted array .These were my answers all unaccepted
For sorted I suggested we can find the center and do a linear search .Binary search will also work here
For unsorted I suggested linear again .
Then I suggested Binary which is kind of wrong.
Suggested storing the array in a hashset and utilize hashing . (Not accepted since high space complexcity)
I suggested Tree Set which is a Red Black tree quite good for lookup.(Not accepted since high space complexcity)
Copying into Arraylist etch were also considered overhead.
In the end I got a negative feedback.
Though we may think that one of the above is solution but surely there is something special in linear searching which I am missing.
To be noted sorting before searching is also an overhead especially if you are utilizing any extra data structures in between.
Any comments welcomed.
I am not sure what he had in mind.
If you just want to find the number one time, and you have no guarantees about whether the array is sorted, then I don't think you can beat linear search. On average you will need to seek halfway through the array before you find the value, i.e. expected running time O(N); when sorting you have to touch every single value at least once and probably more than that, i.e. expected running time O(N log N).
But if you need to find multiple values then the time spent sorting it pays off quickly. With a sorted array, you can binary search in O(log N) time, so for sure by the third search you are ahead if you invested the time to sort.
You can do even better if you are allowed to build different data structures to help with the problem. You could build some sort of index, such as a hash table; but the champion data structure for this sort of problem probably would be some sort of tree structure. Then you can insert new values into the tree faster than you could append new values and re-sort the array, and the lookup is still going to be O(log N) to find any value. There are different sorts of trees available: binary tree, B-tree, trie, etc.
But as #Hot Licks said, a hash table is often used for this sort of thing, and it's pretty cheap to update: you just append a value on the main array, and update the hash table to point to the new value. And a hash table is very close to O(1) time, which you can't beat. (A hash table is O(1) if there are no hash collisions; assuming a good hash algorithm and a big enough hash table there will be almost no collisions. I think you could say that a hash table is O(N) where N is the average number of hash collisions per "bucket". If I'm wrong about that I expect to be corrected very quickly; this is StackOverflow!)
I think the interviewer wants you to analyze under different case about the array initial state, what algorithm will you use. Of cause , you must know you can build a hash table and then O(1) can find the number, or when the array is sorted (time spent on sorting maybe concerned) , you can use binarysearch, or use some other data structures to finish the job.

How to design a hashfunction that is scalable to exactly n elements?

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:
for ( i = 0; i < len; i++ )
h = 33 * h + p[i];
To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.
My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?
It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.
In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.
Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.
Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.
If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)
My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.
And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.
But keep in mind that it really depends on the n that you're using.
If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.
A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.
aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".
What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.
The gperf program generates C code for perfect hashes.
It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might
fit your needs. Disclaimer: I've never used it.
The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).
