How to serialize a Graph-like AVL tree to disk? - c

I know it sounds weird but this is it. I have a data structure which is basically a modified AVL tree. Each node of the the structure has a left child and a right child. These core pointers (left & right) will be used to link all data nodes together and to keep the data structure balanced (AVL rotations) to improve searching. But those are not the only pointers in the structure, there are others that can point to any random node in the tree (Which creates the graph-like analogy).
The tree is built at runtime through user interaction (CLI). The user is also responsible for creating all the different links between the nodes.
An example of such a data structure could be (Didn't start coding yet, it's only prototyping):
struct node {
struct node *left;
struct node *right
struct node *links[NUM]; // Points to any random node in the tree.
/* Probably many other fields here that could be either pointers
or other data types */
}
Now, everything is in RAM. Once the user wants to exit, all the data nodes (The whole tree) should be saved to a file in binary mode (For later reloading, so one must take this in consideration).
It's, basically, easy to save the AVL tree using one the recursive tree traversal algorithms (In this case the question is a duplicate because solutions already exist in SO). But, in my case, i have to preserve all the arbitrarily created links between the nodes.
What could be the most efficient way in time & space ?

You could dump your data structure as is (including the pointer values) and, in the binary blob of each node, also add its address. When reloading the data structure you will dynamically allocate your nodes and store their new addresses in a hash table which access keys are the old addresses. In a final pass you will parse your hash table sequentially (not using the old addresses as keys), retrieve the new address of each node, and update its pointer fields from old addresses to new addresses using again your hash table as a translation table (with the old addresses as access keys).

Choose a unique index number for each node, and use it to serialize the links.
This will likely take two traversal passes -- one to set the index number, and one to do the serialization. Add an integer field to your node to hold the index number; you shouldn't need any other memory overhead.
Alternately, if you manage your tree nodes by storing them in an array or std::vector, you will already have an index number handy, and you won't need an additional index field. Also, you can store all your links as indices instead of pointers, so you can just serialize your container as-is.

Related

Representation of Tree Data Structures

I've studied that Data Structures can be classified into Linear (arrays, stacks, queues and linked-lists) and Non-Linear (trees and graphs) data structures.
flowchart of data structures -- image source: medium
Now, my question is that if linked-lists are "linear" data structures, then why are they used to implement trees, which are non-linear? Since trees have nodes that consist of the keys (or values) and pointers to more child nodes, aren't trees just more complex linked-lists?
If they are, then how is the statement "linked-lists are linear data structures" justified?
This is how I actually got this question:
I am currently learning data structures in C. So in order to implement the Tree data structure, I use structs, which consist of keys and pointers to the left and right children of each node.
typedef struct Node
{
int key;
struct Node *left;
struct Node *right;
} Node;
Then I wondered that I'm essentially implementing a linked-list (since linked-lists are also done using structs in C).
You are confusing different things.
A tree is logical data-structure which is a special case of a directed graph without cycles.
A tree data-structure can be implemented in various ways. An array which stores indices of children, using actual pointers to memory of children (like in your example), and other ways. your struct is a specific implementation of a tree data structure but there are other implementations.
A 'linked list' typically reffers to specific implementation where elements point to each others memory. There is a one directional linked list, where element points to next one, or bi-directional, where element points to previous and next one.
If you implement trees with pointers and each node has only one child, then this is a special case where the implementation resembles a linked list.
Note: that a linked list may have a loop, while a tree never has a loop, because then it becomes graph (by definition).
Also it is not common for a tree node to point back to its parent, only point to its children, while linked lists sometimes point to parent (previous element).
So linked list is an implementation of logical data structure which is called 'list'. This implementation uses pointers.
List can be implemented in other ways (arrays, histograms, hash tables with counters of amount of appearances of each elements, skip lists for O(log(n)) search etc).
A tree is a logical data structre which is commonly implemented using pointers, but has other implementations as well.
When tree is implemented using pointers, each 1 branch in a tree, resembles a linked list - so this is the sources of your confusion.

B Tree with varying maximum keys?

I am wanting to implement something along the lines of a btree to index some data using variable length keys, where I am expecting that each node in the tree will look something like this,
struct key_block {
block_ptr parent; // link back up the tree to the parent
unsigned numkeys; // number of keys currently used by this block
struct {
block_ptr child; // points to the child immediately preceeding this key.
struct {
unsigned length; // how long this key is
unsigned offset; // where the data for this key is
} key; // support for variable length keys
data_ptr content; // ptr to the data indexed by this key
} entries[]; // as many entries as will fit on a disk block.
}; // the last entry will be followed by another block_ptr which is the right hand child of the last node.
What I am intending to is store the actual key data in the same disk block as the node itself, positioned just after the final key and its right hand child that was within the node. The offset and length fields in each key will indicate how far from the start of the current block the actual data for each key begins and how long it runs for.
However, I am wanting to use a fixed size of disk block for my storage, and since I am wanting to store the variable length keys inside of the same block as the node, that means that the maximum number of keys that can be in a node depends on the length of the keys in that node. This sort of contradicts my understanding of the way a btree generally works, where all nodes have a fixed maximum number of entries, and I am not sure whether I can really implement this using a btree at all because I am violating that typical invariant.
So should I even be looking at using a btree structure? If not, what other alternatives exist for very fast external searching, insertion, and deletion? In particular, a key criteria for any solution must be that it is highly scalable to supporting very VERY large numbers of entries, and still be efficient for searching, insertion, and deletion (and btrees perform adequately on this front).
If I can still use a btree, how would the algorithm be affected when I no longer have an invariant maximum number of keys, but instead the maximum depends on the content of each individual node itself?
There is no fundamental issue with a variable number of maximum keys in a B-tree. However, a B-tree does depend on some minimum and maximum number of keys in each node. If you have a fixed number of keys per node, then this is easy (usually N/2 to N nodes). Because you allow a variable number, you will need to determine a heuristic for balancing the tree. The better the heuristic, the more optimal the performance.
Fortunately, the issue will merely be performance. The shape of the B-tree has several invariants, but none of them are affected by your variable number of keys, so you will still be able to search. It just might be a poorly-balancing structure if you choose a bad heuristic.

binary seach tree index implementation using symbol tables

I am reading about index implementation using symbol tables in book by author Robert Sedwick in Algorithms in C++.
Below is snippet from the book
We can adapt binary search trees to build indices in precisely the
same manner as we provided indirection for sorting and for heaps.
Arrange for keys to be extracted from items via the key member
function, as usual. Moreover, we can use parallel arrays for the
links, as we did for linked lists. We use three arrays, one each for
the items, left links, and right links. The links are array indices
(integers), and we replace link references such as
x = x->l
in all our code with array references such as
x = l[x].
This approach avoids the cost of dynamic memory allocation for each
nodeā€”the items occupy an array without regard to the search function,
and we preallocate two integers per item to hold the tree links,
recognizing that we will need at least this amount of space when all
the items are in the search structure. The space for the links is not
always in use, but it is there for use by the search routine without
any time overhead for allocation. Another important feature of this
approach is that it allows extra arrays (extra information associated
with each node) to be added without the tree-manipulation code being
changed at all. When the search routine returns the index for an item,
it gives a way to access immediately all the information associated
with that item, by using the index to access an appropriate array.
This way of implementing BSTs to aid in searching large arrays of
items is sometimes useful, because it avoids the extra expense of
copying items into the internal representation of the ADT, and the
overhead of allocation and construction by new. The use of arrays is
not appropriate when space is at a premium and the symbol table grows
and shrinks markedly, particularly if it is difficult to estimate the
maximum size of the symbol table in advance. If no accurate size
prediction is possible, unused links might waste space in the item
array.
My questions on above text are
What does author mean by "we can use parallel arrays for the links as we did for linked lists" ? What does this statment mean and what are parallel arrays.
What does author mean links are array indices and we replace link references such x= x->l with x=l[x]?
What does author mean by "Another important feature of this approach is that it allows extra arrays (extra information associated with each node) to be added without the tree-manipulation code being changed at all." ?
You appear to have edited the text to take out the useful references. Either that or you have an earlier version of the text.
My third edition states that the index builds are covered in section 9.6, where it covers the process, and the parallel arrays are explained in chapter 3. The parallel arrays are simply storing the payload (the keys and possibly data that are held in the tree) and left/right pointers in three or more separate arrays, using the index to tie them together (x = left[x]). In that case, you may end up with something like:
int leftptr[100];
int rightptr[100];
char *payload[100];
and so on. In that example, node # 74 would have its data stored in payload[74], and the left and right "pointers" (actually indexes) stored in left[74] and right[74] respectively.
This is in contrast to having a single array of structures with the structure holding payload and pointers together (x = x->left;):
struct sNode {
struct sNode *left, right;
char payload[];
};
So, for your specific questions:
Parallel arrays are simply separating the tree structure information from the payload information and using the index to tie together information from those arrays.
Since you're using arrays for the links (and these arrays now hold array indexes rather than pointers), you no longer use x = x->left to move to the left child. Instead you use x = left[x].
The tree manipulation is only interested in the links. By having the links separated from the payload (and other possibly useful information), the code for manipulating tree structure can be simpler.
If you haven't already, you should flip back in the book to the section on linked-lists where he says the technique was used previously (it's probably explained there).
Parallel arrays means we don't have a struct to hold the node information.
struct node {
int data;
struct node *left;
struct node *right;
};
Instead, we have arrays.
int data[SIZE];
int left[SIZE];
int right[SIZE];
These are parallel arrays because we will use the same index to access the data and links. The node is represented in our code by an index, not a pointer. So for node 4, the data is at
data[4];
The left link is at
left[4];
Adding more information at the node can be done by creating yet another array of the same size.
int extra[SIZE];
The extra data for node 4 will be at
extra[4];

Flatten a Tree into an Array

I am looking for the best way to place a tree into an array
The idea is to follow this principle : Array Implementation of Trees
but I'am stuck on how to know what nodes are the children and what nodes are at the same level, because I'am not using a binary tree.
I might have to store ASCII but I can't simply allow arrays of 256 pointers !
Any idea would be welcome.
The purpose of this, is to send an array (tree) to my GPU, instead of using structures.
Well, here is my idea of converting tree into an array.
Take an array of size MAX_VAL, which is the total number of nodes in the tree. The type of the array should be same as that of a node but with one extra field. Its the index value for its parent. You store each node in this way. Store the root node at first position. Say 1. Now the child nodes of this node are stored subsequently with the extra field storing 1 (since this was where root was stored).
Apply this procedure on all nodes and you are done. You can get back the tree, by a simple recursive call on each node.
Hope this helps. :) :)
Ahnentafel lists are very big if not near-perfectly balanced. My guess is your tree isn't going to be balanced, so the benefit of implicit parent/child pointers will outweigh the cost. I'm never seen a non-binary Ahnentafel list, but I assume it's possible (were you asking for the implicit equations?).
Could you keep a sorted list of child pointers for each node (ASCII character + pointer/index)? In this case it might be best, as others suggest, to construct the tree using pointers and allow the children to grow. Then pack all the nodes into a list: work out an order to place the nodes, use prefix sums for their offsets into the array, store the position indices on each node and finally copy the children lists into the array (replacing the children pointers with list indices can be done by following the pointers and querying the index from the previous step).
Traversing to a child in CUDA won't be constant time, but since the order is know you can use a binary search to speed things up.

Is there strong reason not to include multiple node pointers in a node to use in more than one data structure?

Take for example the assignment I'm working on. We're to use a binary search tree for one piece of a set of data and then a linked list for another piece in the set. The suggested method by the professor was:
struct treeNode
{
data * item;
treeNode *left, *right;
};
struct listNode
{
data * item;
listNode *next, *prev;
};
class collection
{
public:
........
}
Where data is a class containing the particulars of each record. Obviously as it's set up, a treeNode can't exist in the linked list.
Wouldn't it be much simpler to:
struct node
{
data * item;
node *listNext, *listPrev, *treeLeft, *treeRight;
};
then we can declare:
node * listHead;
node * treeRoot;
and include both insertion algorithms into the class.
Is there something I'm missing?
Actually, the data items are to be inserted into both lists. The (mundane) purpose of the assignment is to sort the data sets in two different elements in the set.
So with that said, wouldn't I be saving memory? Combining the 2 nodes I end up with 5 pointers, if I left them separate I'd be using 6. Also I really only have one group of data this way. if I had 250 data items to keep track of, I'd have one group of 1250 pointers instead of 2 lists of 750. Maybe I'm misunderstanding what actually gets allocated with pointer calls.
You can do that, but you are wasting memory with the extra pointers. Also, it tends to be more confusing to mix types like that. Am I correct in assuming that the data is either put into the list or put into the tree, but not inserted into both? There's really not much reason to have them both use the same structure if they are different data types anyway. If you are inserting the same data into both types, you could potentially switch from traversing the tree to traversing the list if you had any use for such an action.
Since you're inserting the data into both lists, It would save memory to use your composite node structure. I would insert into the binary tree first, then insert the allocated node into the linked list. You wouldn't really end up with a pure linked list or a binary search tree, but it would be able to be traversed like either one.
What was the answer?
If your data is less than (hmmm) megabytes, don't worry about memory consumption. 1 or 2 Gigabytes is typical in normal computers today.
How big are the items? 32 char? 64k of compressed multimedia? Something big?
How reasonable is it to organize one item using both techniques? If the data are really the same, then a 5 pointer structure is interesting- someone could find a node in one ordering and then browse related nodes in the other ordering.
Are the items unrelated, some chalk, some cheese? Are they multidimensional? personnel records? Audio file descriptions? Recipes?
In school, a good teacher is trying to give you experience with common techniques and disciplines. Just like art class, or composition. Pencil, pastels, 5 paragraph essay. So the teacher might want you to write two different classes & constructors. Use one struct for one part of the data, different one for other data. Or the same. Just because.
Outside of school, the data comes in a format and there are operations desired on it/with it. "Use cases" are stories about how data is used, what has to be kept, what algorithms are used.
The point of this might be bimodal searching, 2 pairs of orthogonal pointers. It might be Unions, where each item is asssociated with a list or a tree, but not both at the same time. It might be a flurry of lightwieght subsets, trees and lists, that are compared and contrasted...
When in doubt, "data structures + algorithms = programs". But it pays to know what point the teacher is trying to make, and whether you want to follow their lead. (Usually, in school, you do.)

Resources