I am creating a database storage engine (for fun).
I know it uses b-trees (and stuff), but in all of b-tree base examples, it shows that we need to sort keys and then store it for indexing, not for integers.
I can understand sorting, but how to do it for strings, if I have string as a key for indexing?
Ex : I want to index all email addresses in btree , how would I do that ??
It does not matter, what type of data you are sorting. For a B-Tree you only need a comparator. The first value you put into your db is the root. The second value gets compared to the root. If smaller, then continue down left, else right. Inserting new values often requires to restructure your tree.
A comparator for a string could use the length of the string or compare it alphabetically or count the dots in an email behind the at-sign.
Related
I'm about to start writing a program which will analyze a text and store all the unique words in the text in some form which can later be called upon. When called upon it will give the position of all occurrences of this word in the original text and return the surrounding words as well.
I think the best way to do this would be to use a hashmap because it works with the unique words as a key and then an int[] as the mapped values. But I don't know if this is considered best practice or not. My solution would have one array to store the original text, which might be quite big, and one hashmap with one key-value pair for each unique word which might be almost as large as the array containing the text. How would you solve it?
An alternative possibility is a 26-ary tree (considering your alphabet has 26 characters).
Build your tree storing words you encounter, each node will represent a word ; then in each node you can store an array of pointers pointing towards occurrences of the words in the strings (or an array of int representing indexes).
In terms of memory and complexity, it is equivalent to the hash map implementation (same speed, slightly more compact), but it seems a bit more intuitive to me than the hash map.
So I'd say it's mainly up to you and your favorite structures.
A hash-maps are made for this kind of task.
You should probably map strings to a structure (rather than an int array).
That structure might record position and previous and next word - it's not precisely clear what you mean by 'surrounding'.
You may have to decide if your process is case sensitive. Are "You" and "you" the same word? Depending on the language you may be able to provide a case-insensitive comparator and hashing function or need to 'low case' all the entries.
The input: An array of strings, and a single string.
The task: Find all entries in the array where any substring of the entry matches the input string.
The input array can be prepared or sorted in any way required, and any auxiliary data structure required built. The time required to prepare the data structures is (within bounds of sanity) unimportant.
The goal is maximum speed on the search.
What algorithm would you use that isn't just a linear search?
Because it says time required to prepare data structures is unimportant, I'd hash it. The key is a string (specifically, a substring), and the value is a list of integers corresponding to indices in the array whose elements have the key as a substring.
To build, take each string in the array and determine all possible substrings of that string, inserting each such key-value pair into the hash table. If the key already exists, append the index to the list rather than inserting/creating a new list.
Once you build this hash table, it's as easy as O(1) grab the list based on the input string and return.
EDIT: Looking more closely at the question, it seems like you'd want to return the actual strings in the array, rather than their indices. The hash table approach will work either way.
You might want to build an index of all string suffixes. Look into suffix trees to find out how this could be done. Wikipedia article might be too generalized so here is an adapted algoritm:
Building index
for each string in array
get all its suffixes (there N suffixes for a string of length N) and store a reference to a string in an ordered associative container (OrderedMap> (index)
Searching
find an lower bound of your search term in index
move over an index starting from lower bound until index key won't stop being prefixed with the search term
a sum of all references you will find is your search result
There is N²/2 substrings for a string of length N but only N suffixes. So suffix based data-structure should be more memory effective that substring based.
Many b+ tree examples are implemented using integer key, but i had seen some other examples using both integer key and string key, i learned the b+ tree basis, but i don't understand how string key works?
I also use a multi level B-Tree. Having a string lets say test can be seen as an array of [t,e,s,t]. Now think about a tree of trees. Each node can only hold one character for a certain position. You also need to think about a certain key /value array implementation like a growing linked list of arrays, trees or whatever. It also can make the node size dynamic (limited amount of letters).
If all keys fit the leaf, you store it in the leaf. If the leaf gets to big, you can add new nodes.
And now since each node knows its letter and position, you can strip those characters from the keys in the leaf and reconstruct them as you search or if you know the leaf + the position in the leaf.
If you now, after you have created the tree, write the tree in a certain format, you end up having string compression where you store each letter combination (prefix) only once even if it is shared by 1000ends of strings.
Simple compression often results in a 1:10 compression for normal text (in any language!) and in memory in 1:4. And also you can search for any given word (which are the strings in your dictionary you used the B+Tree for.
This is one extrem where you can use multilevel.
Databases usually use a certain prefix tree (the first x characters and store the rest in the leafs and use binary search within the leaf). Also there are implementations that use variable prefix lengths based on the actual density. So in the end it is very implementation specific and a lot of options exist.
If the tree should aid in finding the exact string. Often adding the length and using hash of lower bits of each characters do the trick. For example you could generate a hash out of length(8bit) + 4bit * 6 characters = 32Bit -> its your hash code. Or you can use the first, last and middle characters along with it. Since the length is one of the most selective you wont find many collisions while search your string.
This solution is very good for finding a particular string but destroyes the natural order of the strings so giving you no chance of answering range queries and alike. But for times where you search for a particular username / email or address those tree would be supperior (but question is why not use a hashmap).
The string key can be a pointer to a string (very likely).
Or the key could be sized to fit most strings. 64 bits holds 8 byte strings and even 16 byte keys aren't too ridiculous.
Choosing a key really depends on how you plan to use it.
Its just a question out of curiosity. Suppose we have an associative array A. How is A["hello"] actually evaluated , as in how does system map to a memory location using index "hello"?
Typically it uses a data structure that facilitates quick lookup in mostly constant time.
One such typical approach is to use a hashtable, where the key ("hello" in your case) would be hashed, and by that I mean that a number is calculated from it. This number is then used as an index into an array, and in the element with that index, the value exists.
Different data structures exists, like binary trees, tries, etc.
You can google for keywords: hashtable, binary tree, trie.
I have to do a table lookup to translate from input A to output A'. I have a function with input A which should return A'. Using databases or flat files are not possible for certain reasons. I have to hardcode the lookup in the program itself.
What would be the the most optimum (*space-wise and time-wise separately): Using a hashmap, with A as the key and A' as the value, or use switch case statements in the function?
The table is a string to string lookup with a size of about 60 entries.
If speed is ultra ultra necessary, then I would consider perfect hashing. Otherwise I'd use an array/vector of string to string pairs, created statically in sort order and use binary search. I'd also write a small test program to check the speed and memory constraints were met.
I believe that both the switch and the table-look up will be equivalent (although one should do some tests on the compiler being used). A modern C compiler will implement a big switch with a look-up table. The table look-up can be created more easily with a macro or a scripting language.
For both solutions the input A must be an integer. If this is not the case, one solution will be to implement a huge if-else statement.
If you have strings you can create two arrays - one for input and one for output (this will be inefficient if they aren't of the same size). Then you need to iterate the contents of the input array to find a match. Based on the index you find, you return the corresponding output string.
Make a key that is fast to calculate, and hash
If the table is pretty static, unlikely to change in future, you could have a look-see if adding a few selected chars (with fix indexes) in the "key" string could get unique values (value K). From those insert the "value" strings into a hash_table by using the pre-calculated "K" value for each "key" string.
Although a hash method is fast, there is still the possibility of collision (two inputs generating the same hash value). A fast method depends on the data type of the input.
For integral types, the fastest table lookup method is an array. Use the incoming datum as an index into the array. One of the problems with this method is that the array must account for the entire spectrum of values for the fastest speed. Otherwise execution is slowed down by translating the original index into an index for the array (kind of like a hashing method).
For string input types, a nested look up may be the fastest. One example is to break up tables by length. The first array returns pointers to the table to search based on length, e.g. char * sub_table = First_Array[5] for a string of length 5. These can be configured for specialized input data.
Another method is to use a B-Tree, which is a binary tree of "pages". Behavior is similar to nested arrays.
If you let us know the input type, we can better answer your question.