Fast unanchored text search for an array - arrays

The input: An array of strings, and a single string.
The task: Find all entries in the array where any substring of the entry matches the input string.
The input array can be prepared or sorted in any way required, and any auxiliary data structure required built. The time required to prepare the data structures is (within bounds of sanity) unimportant.
The goal is maximum speed on the search.
What algorithm would you use that isn't just a linear search?

Because it says time required to prepare data structures is unimportant, I'd hash it. The key is a string (specifically, a substring), and the value is a list of integers corresponding to indices in the array whose elements have the key as a substring.
To build, take each string in the array and determine all possible substrings of that string, inserting each such key-value pair into the hash table. If the key already exists, append the index to the list rather than inserting/creating a new list.
Once you build this hash table, it's as easy as O(1) grab the list based on the input string and return.
EDIT: Looking more closely at the question, it seems like you'd want to return the actual strings in the array, rather than their indices. The hash table approach will work either way.

You might want to build an index of all string suffixes. Look into suffix trees to find out how this could be done. Wikipedia article might be too generalized so here is an adapted algoritm:
Building index
for each string in array
get all its suffixes (there N suffixes for a string of length N) and store a reference to a string in an ordered associative container (OrderedMap> (index)
Searching
find an lower bound of your search term in index
move over an index starting from lower bound until index key won't stop being prefixed with the search term
a sum of all references you will find is your search result
There is N²/2 substrings for a string of length N but only N suffixes. So suffix based data-structure should be more memory effective that substring based.

Related

What would be the best (practice) way to store data about occurrences and positions of words in a text so that it's quickly accessible?

I'm about to start writing a program which will analyze a text and store all the unique words in the text in some form which can later be called upon. When called upon it will give the position of all occurrences of this word in the original text and return the surrounding words as well.
I think the best way to do this would be to use a hashmap because it works with the unique words as a key and then an int[] as the mapped values. But I don't know if this is considered best practice or not. My solution would have one array to store the original text, which might be quite big, and one hashmap with one key-value pair for each unique word which might be almost as large as the array containing the text. How would you solve it?
An alternative possibility is a 26-ary tree (considering your alphabet has 26 characters).
Build your tree storing words you encounter, each node will represent a word ; then in each node you can store an array of pointers pointing towards occurrences of the words in the strings (or an array of int representing indexes).
In terms of memory and complexity, it is equivalent to the hash map implementation (same speed, slightly more compact), but it seems a bit more intuitive to me than the hash map.
So I'd say it's mainly up to you and your favorite structures.
A hash-maps are made for this kind of task.
You should probably map strings to a structure (rather than an int array).
That structure might record position and previous and next word - it's not precisely clear what you mean by 'surrounding'.
You may have to decide if your process is case sensitive. Are "You" and "you" the same word? Depending on the language you may be able to provide a case-insensitive comparator and hashing function or need to 'low case' all the entries.

How do I index variable length strings, integers, binaries in b-tree?

I am creating a database storage engine (for fun).
I know it uses b-trees (and stuff), but in all of b-tree base examples, it shows that we need to sort keys and then store it for indexing, not for integers.
I can understand sorting, but how to do it for strings, if I have string as a key for indexing?
Ex : I want to index all email addresses in btree , how would I do that ??
It does not matter, what type of data you are sorting. For a B-Tree you only need a comparator. The first value you put into your db is the root. The second value gets compared to the root. If smaller, then continue down left, else right. Inserting new values often requires to restructure your tree.
A comparator for a string could use the length of the string or compare it alphabetically or count the dots in an email behind the at-sign.

Finding k different keys using binary search in an array of n elements

Say, I have a sorted array of n elements. I want to find 2 different keys k1 and k2 in this array using Binary search.
A basic solution would be to apply Binary search on them separately, like two calls for 2 keys which would maintain the time complexity to 2(logn).
Can we solve this problem using any other approach(es) for different k keys, k < n ?
Each search you complete can be used to subdivide the input to make it more efficient. For example suppose the element corresponding to k1 is at index i1. If k2 > k1 you can restrict the second search to i1..n, otherwise restrict it to 0..i1.
Best case is when your search keys are sorted also, so every new search can begin where the last one was found.
You can reduce the real complexity (although it will still be the same big O) by walking the shared search path once. That is, start the binary search until the element you're at is between the two items you are looking for. At that point, spawn a thread to continue the binary search for one element in the range past the pivot element you're at and spawn a thread to continue the binary search for the other element in the range before the pivot element you're at. Return both results. :-)
EDIT:
As Oli Charlesworth had mentioned in his comment, you did ask for an arbitrary amount of elements. This same logic can be extended to an arbitrary amount of search keys though. Here is an example:
You have an array of search keys like so:
searchKeys = ['findme1', 'findme2', ...]
You have key-value datastructure that maps a search key to the value found:
keyToValue = {'findme1': 'foundme1', 'findme2': 'foundme2', 'findme3': 'NOT_FOUND_VALUE'}
Now, following the same logic as before this EDIT, you can pass a "pruned" searchKeys array on each thread spawn where the keys diverge at the pivot. Each time you find a value for the given key, you update the keyToValue map. When there are no more ranges to search but still values in the searchKeys array, you can assume those keys are not to be found and you can update the mapping to signify that in some way (some null-like value perhaps?). When all threads have been joined (or by use of a counter), you return the mapping. The big win here is that you did not have to repeat the initial search logic that any two keys may share.
Second EDIT:
As Mark has added in his answer, sorting the search keys allows you to only have to look at the first item in the key range.
You can find academic articles calculating the complexity of different schemes for the general case, which is merging two sorted sequences of possibly very different lengths using the minimum number of comparisons. The paper at http://www.math.cmu.edu/~af1p/Texfiles/HL.pdf analyses one of the best known schemes, by Hwang and Lin, and has references to other schemes, and to the original paper by Hwang and Lin.
It looks a lot like a merge which steps through each item of the smaller list, skipping along the larger list with a stepsize that is the ratio of the sizes of the two lists. If it finds out that it has stepped too far along the large list it can use binary search to find a match amongst the values it has stepped over. If it has not stepped far enough, it takes another step.

how to implement the string key in B+ Tree?

Many b+ tree examples are implemented using integer key, but i had seen some other examples using both integer key and string key, i learned the b+ tree basis, but i don't understand how string key works?
I also use a multi level B-Tree. Having a string lets say test can be seen as an array of [t,e,s,t]. Now think about a tree of trees. Each node can only hold one character for a certain position. You also need to think about a certain key /value array implementation like a growing linked list of arrays, trees or whatever. It also can make the node size dynamic (limited amount of letters).
If all keys fit the leaf, you store it in the leaf. If the leaf gets to big, you can add new nodes.
And now since each node knows its letter and position, you can strip those characters from the keys in the leaf and reconstruct them as you search or if you know the leaf + the position in the leaf.
If you now, after you have created the tree, write the tree in a certain format, you end up having string compression where you store each letter combination (prefix) only once even if it is shared by 1000ends of strings.
Simple compression often results in a 1:10 compression for normal text (in any language!) and in memory in 1:4. And also you can search for any given word (which are the strings in your dictionary you used the B+Tree for.
This is one extrem where you can use multilevel.
Databases usually use a certain prefix tree (the first x characters and store the rest in the leafs and use binary search within the leaf). Also there are implementations that use variable prefix lengths based on the actual density. So in the end it is very implementation specific and a lot of options exist.
If the tree should aid in finding the exact string. Often adding the length and using hash of lower bits of each characters do the trick. For example you could generate a hash out of length(8bit) + 4bit * 6 characters = 32Bit -> its your hash code. Or you can use the first, last and middle characters along with it. Since the length is one of the most selective you wont find many collisions while search your string.
This solution is very good for finding a particular string but destroyes the natural order of the strings so giving you no chance of answering range queries and alike. But for times where you search for a particular username / email or address those tree would be supperior (but question is why not use a hashmap).
The string key can be a pointer to a string (very likely).
Or the key could be sized to fit most strings. 64 bits holds 8 byte strings and even 16 byte keys aren't too ridiculous.
Choosing a key really depends on how you plan to use it.

The optimum* way to do a table-lookup-like function in C?

I have to do a table lookup to translate from input A to output A'. I have a function with input A which should return A'. Using databases or flat files are not possible for certain reasons. I have to hardcode the lookup in the program itself.
What would be the the most optimum (*space-wise and time-wise separately): Using a hashmap, with A as the key and A' as the value, or use switch case statements in the function?
The table is a string to string lookup with a size of about 60 entries.
If speed is ultra ultra necessary, then I would consider perfect hashing. Otherwise I'd use an array/vector of string to string pairs, created statically in sort order and use binary search. I'd also write a small test program to check the speed and memory constraints were met.
I believe that both the switch and the table-look up will be equivalent (although one should do some tests on the compiler being used). A modern C compiler will implement a big switch with a look-up table. The table look-up can be created more easily with a macro or a scripting language.
For both solutions the input A must be an integer. If this is not the case, one solution will be to implement a huge if-else statement.
If you have strings you can create two arrays - one for input and one for output (this will be inefficient if they aren't of the same size). Then you need to iterate the contents of the input array to find a match. Based on the index you find, you return the corresponding output string.
Make a key that is fast to calculate, and hash
If the table is pretty static, unlikely to change in future, you could have a look-see if adding a few selected chars (with fix indexes) in the "key" string could get unique values (value K). From those insert the "value" strings into a hash_table by using the pre-calculated "K" value for each "key" string.
Although a hash method is fast, there is still the possibility of collision (two inputs generating the same hash value). A fast method depends on the data type of the input.
For integral types, the fastest table lookup method is an array. Use the incoming datum as an index into the array. One of the problems with this method is that the array must account for the entire spectrum of values for the fastest speed. Otherwise execution is slowed down by translating the original index into an index for the array (kind of like a hashing method).
For string input types, a nested look up may be the fastest. One example is to break up tables by length. The first array returns pointers to the table to search based on length, e.g. char * sub_table = First_Array[5] for a string of length 5. These can be configured for specialized input data.
Another method is to use a B-Tree, which is a binary tree of "pages". Behavior is similar to nested arrays.
If you let us know the input type, we can better answer your question.

Resources