Is it worth representing a red black tree as an array to eliminate the memory overhead. Or will the array take up more memory since the array will have empty slots?
It will have both positive and negative sides. This answer is applicable for C [since you mentioned this is what you will use]
Positive sides
Lets assume you have created an array as pool of objects that you will use for red-black tree. Deleting an element or initializing a new element when the position is found will be a little fast, because you probably will use the memory pool you have created yourself.
Negative sides
Yes the array will most probably end up taking more memory since the array will have empty slots sometimes.
You have to be sure about the MAX size of the red-black trees in this case. So there is a limitation of size.
You are not using the benefit of sequential memory space, so that might be a waste of resource.
Yes, you can represent red-black tree as an array, but it's not worth it.
Maximum height of red-black tree is 2*log2(n+1), where n is number of entries. Number of entries in array representation on each level is 2**n, where n is level. So to store 1_000 entries you'd have to allocate array of 1_048_576 entries. To store 1_000_000 entries you'd have to allocate array of 1_099_511_627_776 entries.
It's not worth it.
Red-back tree (and most data structures, really) doesn't care about which storage facility is used, that means you can use array or even HashTable/Map to store the tree node, the array index or map key is your new "pointer". You can even put the tree on the disk as a file and use file offset as node index if you would like to (though, in this case, you should use B-Tree instead).
The main problem is increased complexity as now you have to manage the storage manually (opposed to letting the OS and/or language runtime do it for you). Sometimes you want to scale the array up so you can store more items, sometimes you want to scale it down (vacuum) to free up unused space. These operations can be costly on their own.
Memory usage wise, storage facility does not change how many nodes on the tree. If you have 2,000 nodes on your old school pointerer tree (tree height=10), you'll still have 2,000 nodes on your fancy arrayilized tree (tree height is still 10). However, redundant space may exists in between vacuum operations.
Related
The default way to implement dynamic arrays is to use realloc. Once len == capacity we use realloc to grow our array. This can cause copying of the whole array to another heap location. I don't want this copying to happen, since I'm designing a dynamic array that should be able to store large amount of elements, and the system that would run this code won't be able to handle such a heavy operation.
Is there a way to achieve that?
I'm fine with loosing some performance - O(logN) for search instead of O(1) is okay. I was thinking that I could use a hashtable for this, but it looks like I'm in a deadlock since in order to implement such a hashtable I would need a dynamic array in the first place.
Thanks!
Not really, not in the general case.
The copy happens when the memory manager can't increase the the current allocation, and needs to move the memory block somewhere else.
One thing you can try is to allocate fixed sized blocks and keep a dynamic array pointing to the blocks. This way the blocks don't need to be reallocated, keeping the large payloads in place. If you need to reallocate, you only reallocate the array of reference which should be much cheaper (move 8 bytes instead 1 or more MB). The ideal case the block size is about sqrt(N), so it's not working in a very general case (any fixed size will be some large or some small for some values).
If you are not against small allocations, you could use a singly linked list of tables, where each new table doubles the capacity and becomes the front of the list.
If you want to access the last element, you can just get the value from the last allocated block, which is O(1).
If you want to access the first element, you have to run through the list of allocated blocks to get to the correct block. Since the length of each block is two times the previous one, this means the access complexity is O(logN).
This data structures relies on the same principles that dynamic arrays use (doubling the size of the array when expanding), but instead of copying the values after allocating a new block, it keeps track of the previous block, meaning accessing the previous blocks adds overhead but not accessing the last ones.
The index is not a position in a specific block, but in an imaginary concatenation of all the blocks, starting from the first allocated block.
Thus, this data structure cannot be implemented as a recursive type because it needs a wrapper keeping track of the total capacity to compute which block is refered to.
For example:
There are three blocks, of sizes 100, 200, 400.
Accessing 150th value (index 149 if starting from 0) means the 50th value of the second block. The interface needs to know the total length is 700, compare the index to 700 - 400 to determine whether the index refers to the last block (if the index is above 300) or a previous block.
Then, the interface compares with the capacity of the previous blocks (300 - 200) and knows 150 is in the second block.
This algorithm can have as many iterations as there are blocks, which is O(logN).
Again, if you only try to access the last value, the complexity becomes O(1).
If you have concerns about copy times for real time applications or large amounts of data, this data structure could be better than having a contiguous storage and having to copy all of your data in some cases.
I ended up with the following:
Implement "small dynamic array" that can grow, but only up to some maximum capacity (e.g. 4096 words).
Implement an rbtree
Combine them together to make a "big hash map", where "small array" is used as a table and a bunch of rbtrees are used as buckets.
Use this hashmap as a base for a "big dynamic array", using indexes as keys
While the capacity is less than maximum capacity, the table grows according to the load factor. Once the capacity reached maximum, the table won't grow anymore, and new elements are just inserted into buckets. This structure in theory should work with O(log(N/k)) complexity.
In C, which is more efficient in terms of memory management, a linked list or an array?
For my program, I could use one or both of them. I would like to take this point into consideration before starting.
Both link list and array have good and bad sides.
Array
Accessing at a particular position take O(1) time, because memory initialized is consecutive for array. So if address of first position is A, then address of 5th element if A+4.
If you want to insert a number at some position it will take O(n) time. Because you have to shift every single numbers after that particular position and also increase size of array.
About searching an element. Considering the array is sorted. you can do a binary search and accessing each position is O(1). So you do the search in order of binary search. In case the array is not sorted you have to traverse the entire array so O(n) time.
Deletion its the exact opposite of insertion. You have to left shift all the numbers starting from the place where you deleted it. You might also need to recrete the array for memory efficiency. So O(n)
Memory must be contiguous, which can be a problem on old x86 machines
with 64k segments.
Freeing is a single operation.
LinkList
Accessing at a particular position take O(n) time, because you have to traverse the entire list to get to a particular position.
If you want to insert a number at some position and you have a pointer at that position already, it will take O(1) time to insert the new value.
About searching an element. No matter how the numbers are arranged you have to traverse the numbers from front to back one by one to find your particular number. So its always O(n)
about deletion its the exact opposite of insertion. If you know the position already by some pointer suppose the list was like this . p->q->r you want to delete q all you need is set next of p to r. and nothing else. So O(1) [Given you know pointer to p]
Memory is dispersed. With a naive implementation, that can be bad of cache coherency, and overall take can be high because the memory allocation system has overhead for each node. However careful programming can get round this problem.
Deletion requires a separate call for each node, however again careful programming can get round this problem.
So depending on what kind of problem you are solving you have to choose one of the two.
Linked list uses more memory, from both the linked list itself and inside the memory manager due to the fact you are allocating many individual blocks of memory.
That does not mean it is less efficient at all, depending on what you are doing.
While a linked list uses more memory, adding or removing elements from it is very efficient, as it doesn't require moving data around at all, while resizing a dynamic array means you have to allocate a whole new area in memory to fit the new and modified array with items added/removed. You can also sort a linked list without moving it's data.
On the other hand, arrays can be substantially faster to iterate due to caching, path prediction etc, as the data is placed sequentially in memory.
Which one is better for you will really depend on the application.
I would like to write a piece of code for inserting a number into a sorted array at the appropriate position (i.e. the array should still remain sorted after insertion)
My data structure doesn't allow duplicates.
I am planning to do something like this:
Find the right index where I should be putting this element using binary search
Create space for this element, by moving all the elements from that index down.
Put this element there.
Is there any other better way?
If you really have an array and not a better data structure, that's optimal. If you're flexible on the implementation, take a look at AA Trees - They're rather fast and easy to implement. Obviously, takes more space than array, and it's not worth it if the number of elements is not big enough to notice the slowness of the blit as compared to pointer magic.
Does the data have to be sorted completely all the time?
If it is not, if it is only necessary to access the smallest or highest element quickly, Binary Heap gives constant access time and logn addition and deletion time.
More over it can satisfy your condition that the memory should be consecutive, since you can implement a BinaryHeap on top of an array (I.e; array[2n+1] left child, array[2n+2] right child).
A heap based implementation of a tree would be more efficient if you are inserting a lot of elements - log n for both locating/removing and inserting operations.
This is an assignment question that I am having trouble wording an answer to.
"Suppose a tree may have up to k children per node. Let v be the average number of children per node. For what value(s) of v is it more efficient (in terms of space used) to store the child nodes in a linked list versus storage in an array? Why?"
I believe I can answer the "why?" more or less in plain English -- it will be more efficient to use the linked list because rather than having a bunch of empty nodes (ie empty indexes in the array if your average is lower than the max) taking up memory you only alloc space for a node in a linked list when you're actually filling in a value.
So if you've got an average of 6 children when your maximum is 200, the array will be creating space for all 200 children of each node when the tree is created, but the linked list will only alloc space for the nodes as needed. So, with the linked list, space used will be approximately(?) the average; with the array, spaced used will be the max.
...I don't see when it would ever be more efficient to use the array. Is this a trick question? Do I have to take into account the fact that the array needs to have a limit on total number of nodes when it's created?
For many commonly used languages, the array will require allocating storage k memory addresses (of the data). A singly-linked list will require 2 addresses per node (data & next). A doubly-linked list would require 3 addresses per node.
Let n be the actual number of children of a particular node A:
The array uses k memory addresses
The singly-linked list uses 2n addresses
The doubly-linked list uses 3n addresses
The value k allows you to determine if 2n or 3n addresses will average to a gain or loss compared to simply storing the addresses in an array.
...I don't see when it would ever be more efficient to use the array. Is this a trick question?
It’s not a trick question. Think of the memory overhead that a linked list has. How is a linked list implemented (vs. an array)?
Also (though this is beyond the scope of the question!), space consumption isn’t the only deciding factor in practice. Caching plays an important role in modern CPUs and storing the individual child nodes in an array instead of a linked list can improve the cache locality (and consequently the tree’s performance) drastically.
Arrays must pre-allocate space but we can use them to access very fast any entry.
Lists allocate memory whenever they create a new node, and that isn't ideal because
memory allocation costs on cpu time.
Tip: you can allocate the whole array at once, if you want to, but usually we allocate,
lets say, 4 entries and we resize it by doubling the size when we need more space.
I can imagine it could be a very good idea and very efficient in many scenarios to use a LinkedList if the data item one uses has intrinsic logic for a previous and next item anyway, for example a TimeTableItem or anything that is somehow time-related. These should implement an interface, so the LinkedList implementation can leverage that and doesn't have to wrap the items into its own node objects. Inserting and removing would be much more efficient here than using a List implementation which internally juggles arrays around.
You're assuming the array can't be dynamically re-allocated. If it can, the array wins, because it need be no bigger than k items (plus a constant overhead), and because it doesn't need per-item storage for pointers.
when implementing a heap structure, we can store the data in an array such that the children of the node at position i are at position 2i and 2i+1.
my question is, why dont we use an array to represent binary search trees and instead we deal with pointers etc.?
thanks
Personally
Because using pointers its easier to
grow the data structure size
dynamically
I find It's easier to maintain bin
tree than a heap
The algorithms to balance, remove, insert elements in the tree will alter only pointers and not move then physically as in a vector.
and so on...
If the position of all children is statically precomputed like that, then the array essentially represents a completely full, completely balanced binary tree.
Not all binary trees in "real life" are completely full and perfectly balanced. If you should happen to have a few especially long branches, you'd have to make your whole array a lot larger to accomodate all nodes at the bottom-most level.
If an array-bound binary tree is mostly empty, most of the array space is wasted.
If only some of the tree's branches are deep enough to reach to the "bottom" of the array, there's also a lot of space being wasted.
If the tree (or just one branch) needs to grow "deeper" than the size of the array will allow, this would require "growing" the array, which is usually implemented as copying to a larger array. This is a time-expensive operation.
So: Using pointers allows us to grow the structure dynamically and flexibly. Representing a tree in an array is a nice academic exercise and works well for small and simple cases but often does not fulfill the demands of "real" computing.
Mainly because the recursive tree allows for very simple code. If you flatten the tree into an array, the code becomes really complex because you have to do a lot of bookkeeping which the recursive algorithm does for you.
Also, a tree of height N can have anything between N and 2^(N+1)-1 nodes (. Only the actual nodes will need memory. If you use an array, you must always allocate space for all nodes (even the empty ones) unless you use a sparse array (which would make the code even more complex). So while it is easy to keep a sparse tree of height 100 in memory, it would be problematic to find a computer which can allocate 20282409603651670423947251286008 bytes of RAM.
To insert an element into a heap, you can place it anywhere and swap it with its parent until the heap constraint is valid again. Swap-with-parent is an operation that keeps the binary tree structure of the heap intact. This means a heap of size N will be represented as an N-cell array, and you can add a new element in logarithmic time.
A binary search tree can be represented as an array of size N using the same representation structure as a heap (children 2n and 2n+1), but inserting an element this way is a lot harder, because unlike the heap constraint, the binary search tree constraint requires rotations to be performed to retrieve a balanced tree. So, either you do manage to keep an N-node tree in an N-cell array at a cost higher than logarithmic, or you waste space by keeping the tree in a larger array (if my memory serves, a red-back tree could waste as much as 50% of your array).
So, a binary search tree in an array is only interesting if the data inside is constant. And if it is, then you don't need the heap structure (children 2n and 2n+1) : you can just sort your array and use binary search.
As far as I know, we can use Array to represent binary search trees.
But it is more flexible to use pointers.
The array based implementation is useful if you need a heap that is used as a priority queue in graph algorithms. In that case, the elements in the heap are constant, you pop the top most element and insert new elements. Removing the top element (or min-element) requires some re-balancing to become a heap again, which can be done such that the array is reasonably balanced.
A reference for this is the algorithm by Goldberg and Tarjan about efficiently computing optimal network flow in directed graphs, iirc.
Heap data structure is a complete binary tree unlike BST. Hence, using arrays is not of much use for BST.