how to implement the string key in B+ Tree? - c

Many b+ tree examples are implemented using integer key, but i had seen some other examples using both integer key and string key, i learned the b+ tree basis, but i don't understand how string key works?

I also use a multi level B-Tree. Having a string lets say test can be seen as an array of [t,e,s,t]. Now think about a tree of trees. Each node can only hold one character for a certain position. You also need to think about a certain key /value array implementation like a growing linked list of arrays, trees or whatever. It also can make the node size dynamic (limited amount of letters).
If all keys fit the leaf, you store it in the leaf. If the leaf gets to big, you can add new nodes.
And now since each node knows its letter and position, you can strip those characters from the keys in the leaf and reconstruct them as you search or if you know the leaf + the position in the leaf.
If you now, after you have created the tree, write the tree in a certain format, you end up having string compression where you store each letter combination (prefix) only once even if it is shared by 1000ends of strings.
Simple compression often results in a 1:10 compression for normal text (in any language!) and in memory in 1:4. And also you can search for any given word (which are the strings in your dictionary you used the B+Tree for.
This is one extrem where you can use multilevel.
Databases usually use a certain prefix tree (the first x characters and store the rest in the leafs and use binary search within the leaf). Also there are implementations that use variable prefix lengths based on the actual density. So in the end it is very implementation specific and a lot of options exist.
If the tree should aid in finding the exact string. Often adding the length and using hash of lower bits of each characters do the trick. For example you could generate a hash out of length(8bit) + 4bit * 6 characters = 32Bit -> its your hash code. Or you can use the first, last and middle characters along with it. Since the length is one of the most selective you wont find many collisions while search your string.
This solution is very good for finding a particular string but destroyes the natural order of the strings so giving you no chance of answering range queries and alike. But for times where you search for a particular username / email or address those tree would be supperior (but question is why not use a hashmap).

The string key can be a pointer to a string (very likely).
Or the key could be sized to fit most strings. 64 bits holds 8 byte strings and even 16 byte keys aren't too ridiculous.
Choosing a key really depends on how you plan to use it.

Related

What would be the best (practice) way to store data about occurrences and positions of words in a text so that it's quickly accessible?

I'm about to start writing a program which will analyze a text and store all the unique words in the text in some form which can later be called upon. When called upon it will give the position of all occurrences of this word in the original text and return the surrounding words as well.
I think the best way to do this would be to use a hashmap because it works with the unique words as a key and then an int[] as the mapped values. But I don't know if this is considered best practice or not. My solution would have one array to store the original text, which might be quite big, and one hashmap with one key-value pair for each unique word which might be almost as large as the array containing the text. How would you solve it?
An alternative possibility is a 26-ary tree (considering your alphabet has 26 characters).
Build your tree storing words you encounter, each node will represent a word ; then in each node you can store an array of pointers pointing towards occurrences of the words in the strings (or an array of int representing indexes).
In terms of memory and complexity, it is equivalent to the hash map implementation (same speed, slightly more compact), but it seems a bit more intuitive to me than the hash map.
So I'd say it's mainly up to you and your favorite structures.
A hash-maps are made for this kind of task.
You should probably map strings to a structure (rather than an int array).
That structure might record position and previous and next word - it's not precisely clear what you mean by 'surrounding'.
You may have to decide if your process is case sensitive. Are "You" and "you" the same word? Depending on the language you may be able to provide a case-insensitive comparator and hashing function or need to 'low case' all the entries.

Single word indexing technique

Suppose i have a large array where each element is one word, and i want to build an index.
Take the word Water, i can write a function that returns
w
wa
wat
wate
water
at
ate
ater
ter
er
r
and those results would be keys in a hash table where the values are arrays of words that contain the key.
Given that i don't care about memory consumption, and the data is read only, i.e inserted only at app startup:
theoretically what would beat this technique in terms of lookup performance?
what the name of this technique?
I think you're looking for a Trie:
a trie, also called digital tree and sometimes radix tree or prefix
tree (as they can be searched by prefixes), is a kind of search
tree—an ordered tree data structure that is used to store a dynamic
set or associative array where the keys are usually strings.

How do I index variable length strings, integers, binaries in b-tree?

I am creating a database storage engine (for fun).
I know it uses b-trees (and stuff), but in all of b-tree base examples, it shows that we need to sort keys and then store it for indexing, not for integers.
I can understand sorting, but how to do it for strings, if I have string as a key for indexing?
Ex : I want to index all email addresses in btree , how would I do that ??
It does not matter, what type of data you are sorting. For a B-Tree you only need a comparator. The first value you put into your db is the root. The second value gets compared to the root. If smaller, then continue down left, else right. Inserting new values often requires to restructure your tree.
A comparator for a string could use the length of the string or compare it alphabetically or count the dots in an email behind the at-sign.

How to store vocabulary in an array more effectively?

I've got a vocabulary, a, abandon, ... , z.
For some reason, I will use array rather than Trie to store them.
Thus a simple method can be: wordA\0wordB\0wordC\0...word\0
But there are some more economic methods for memory I think.
Since like is a substring of likely, we can only store the first position and length of like instead of the string itself. Thus we generate a "large string" which contains every words in vocabulary and use position[i] and length[i] to get the i-th word.
For example, vocabulary contains three words ab, cd and bc.
I construct abcd as the "large string".
position[0] = 0, length[0] = 2
position[1] = 2, length[1] = 2
position[2] = 1, length[2] = 2
So how to generate the "large string" is the key to this problem, are there any cool suggestions?
I think the problem is similar to TSP problem(Traveling Salesman Problem), which is a NP problem.
The search keyword you're looking for is "dictionary". i.e. data structures that can be used to store a list of words, and test other strings for being present or absent in the dictionary.
Your idea is more compact than storing every word separately, but far less compact than a good data structure like a DAWG. As you note, it isn't obvious how to optimally choose how to overlap your strings. What you're doing is a bit like what a lossless compression scheme (like gzip) would do. If you don't need to check words against your compact dictionary, maybe just use gzip or LZMA to compress a sorted word list. Let their algorithms find the redundancy and represent it compactly.
I looked into dictionaries for a recent SO answer that caught my interest: Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)
For a dictionary that doesn't have to have new words added on the fly, a Directed Acyclic Word Graph is the way to go. You match a string against it by following graph nodes until you either hit a point where there's no edge matching the next character, or you get to the end of your input string and find that the node in the DAWG is marked as being a valid end-of-word. (Rather than simply a substring that is only a prefix of some words). There are algorithms for building these state-machines from a simple array-of-words dictionary in reasonable time.
Your method can only take advantage of redundancy when a whole word is a substring of another word, or end-of-one, start-of-another. A DAWG can take advantage of common substrings everywhere, and is also quite fast to match words against. Probably comparable speed to binary-searching your data structure, esp. if the giant string is too big to fit in the cache. (Once you start exceeding cache size, compactness of the data structure starts to outweigh code complexity for speed.)
Less complex but still efficient is a Trie (or Radix Trie), where common prefixes are merged, but common substrings later in words don't converge again.
If you don't need to modify your DAWG or Trie at all, you can store it efficiently in a single block of memory, rather than dynamically allocating each node. You didn't say why you didn't want to use a Trie, and also didn't acknowledge the existence of the other data structures that do this job better than a plain Trie.

how to write order preserving minimal perfect hash for integer keys?

I have searched stackoverflow and google and cant find exactly what im looking for which is this:
I have a set of 4 byte unsigned integers keys, up to a million or so, that I need to use as an index into a table. The easiest would be to simply use the keys as an array index but I dont want to have a 4gb array when Im only going to use a couple of million entries! The table entries and keys are sequential so I need a hash function that preserves order.
e.g.
keys = {56, 69, 3493, 49956, 345678, 345679,....etc}
I want to translate the keys into {0, 1, 2, 3, 4, 5,....etc}
The keys could potentially be any integer but there wont be more than 2 million in total. The number will vary as keys (and corresponding array entries) will be deleted but new keys will always be higher numbered than the previous highest numbered key.
In the above example, if key 69 was deleted, then the hash integer returned on hashing 3493 should be 1 (rather than 2) as it then becomes the 2nd lowest number.
I hope I'm explaining this right. Is the above possible with any fast efficient hashing solution? I need the translation to take in the low 100s of nS though deletion I expect to take longer. I looked at CMPH but couldn't find any usage examples that didn't involved getting the data from a file. It needs to run under linux and compiled with gcc using pure C.
Actually, I don't know if I understand what exactly you want to do.
It seems you are trying to obtain the index number in the "array" (or "list") of sequentialy ordered integers that you have stored somewhere.
If you have stored these integer values in an array, then the algorithm that returns the index integer in optimal time is Binary Search.
Binary Search Algorithm
Since your list is known to be in order, then binary search works in O(log(N)) time, which is very fast.
If you delete an element in the list of "keys", the Binary Search Algorithm works anyway, without extra effort or space (however, the operation of removing one element in the list enforces to you, naturally, to move all the elements being at the right of the deleted element).
You only have to provide three data to the Ninary Search Algorithm: the array, the size of the array, and the desired key, of course.
There is a full Python implementation here. See also the materials available here. If you only need to decode the dictionary, the simplest way to go is to modify the Python code to make it spit out a C file defining the necessary array, and reimplement only the lookup function.
It could be solved by using two dynamic allocated arrays: One for the "keys" and one for the data for the keys.
To get the data for a specific key, you first find in in the key-array, and its index in the key-array is the index into the data array.
When you remove a key-data pair, or want to insert a new item, you reallocate the arrays, and copy over the keys/data to the correct places.
I don't claim this to be the best or most effective solution, but it is one solution to your problem anyway.
You don't need an order preserving minimal perfect hash, because any old hash would do. You don't want to use a 4GB array, but with 2 MB of items, you wouldn't mind using 3 MB of lookup entries.
A standard implementation of a hash map will do the job. It will allow you to delete and add entries and assign any value to entries as you add them.
This leaves you with the question "What hash function might I use on integers?" The usual answer is to take the remainder when dividing by a prime. The prime is chosen to be a bit larger than your expected data. For example, if you expect 2M of items, then choose a prime around 3M.

Resources