I need to frequently add and remove elements to and from a large number of sorted lists.
The lists are bound in size, and relatively small -- around 100 elements or less, with each element being on the order of 32 bytes or thereabouts.
If inserting into a full list, the last element can be discarded.
I've done some simple profiling, and found that storing these lists in arrays and using memmove to move everything after the insertion point backwards works surprisingly well; better even than using a linked list of a fixed size and linking the tail into the insertion point. (Never underestimate the power of spatial locality, I guess.)
But I think I can do even better. I suspect that most of my operations will be near the top of the lists; that means I'm going to memmove the vast majority of the list every time I insert or remove. So what if I were to implement this as a sort of ring buffer, and if an operation is closer to the top than the bottom, I shift the topmost items backwards, with the old tail getting overwritten by the head? This should theoretically involve cheaper calls to memmove.
But I'm completely brain farting on implementing this elegantly. The list can now wrap around, with the head being at position k and the tail being at position (k-1)%n if the list is full. So there's the possibility of doing three operations (k is the head, m is the insertion point, n is the max list size).
memmove elements k through n-1 back one
memcpy element 0 to location n-1
memmove elements 1 through m-1 back one
I don't know if that'll be faster than one bigger memmove, but we'll see.
Anyway, I just have a gut feeling that there's a very clever and clean way to implement this through modulo arithmetic and ternary operators. But I can't think of a way to do this without multiple nested "if" statements.
If we're inserting or removing.
If we're closer to the front or the back.
If the list "wraps around" the end of the array.
If the item being inserted is in the same segment as the head, or if it needs to go in the wrapped around portion.
If the head is at element 0.
I'm sure that too much branching will doom any improvements I make with smaller memmoves. Is there a clean solution here that I just am not seeing?
Related
In C, which is more efficient in terms of memory management, a linked list or an array?
For my program, I could use one or both of them. I would like to take this point into consideration before starting.
Both link list and array have good and bad sides.
Array
Accessing at a particular position take O(1) time, because memory initialized is consecutive for array. So if address of first position is A, then address of 5th element if A+4.
If you want to insert a number at some position it will take O(n) time. Because you have to shift every single numbers after that particular position and also increase size of array.
About searching an element. Considering the array is sorted. you can do a binary search and accessing each position is O(1). So you do the search in order of binary search. In case the array is not sorted you have to traverse the entire array so O(n) time.
Deletion its the exact opposite of insertion. You have to left shift all the numbers starting from the place where you deleted it. You might also need to recrete the array for memory efficiency. So O(n)
Memory must be contiguous, which can be a problem on old x86 machines
with 64k segments.
Freeing is a single operation.
LinkList
Accessing at a particular position take O(n) time, because you have to traverse the entire list to get to a particular position.
If you want to insert a number at some position and you have a pointer at that position already, it will take O(1) time to insert the new value.
About searching an element. No matter how the numbers are arranged you have to traverse the numbers from front to back one by one to find your particular number. So its always O(n)
about deletion its the exact opposite of insertion. If you know the position already by some pointer suppose the list was like this . p->q->r you want to delete q all you need is set next of p to r. and nothing else. So O(1) [Given you know pointer to p]
Memory is dispersed. With a naive implementation, that can be bad of cache coherency, and overall take can be high because the memory allocation system has overhead for each node. However careful programming can get round this problem.
Deletion requires a separate call for each node, however again careful programming can get round this problem.
So depending on what kind of problem you are solving you have to choose one of the two.
Linked list uses more memory, from both the linked list itself and inside the memory manager due to the fact you are allocating many individual blocks of memory.
That does not mean it is less efficient at all, depending on what you are doing.
While a linked list uses more memory, adding or removing elements from it is very efficient, as it doesn't require moving data around at all, while resizing a dynamic array means you have to allocate a whole new area in memory to fit the new and modified array with items added/removed. You can also sort a linked list without moving it's data.
On the other hand, arrays can be substantially faster to iterate due to caching, path prediction etc, as the data is placed sequentially in memory.
Which one is better for you will really depend on the application.
So, I've implemented a binary search tree backed by an array. The full implementation is here.
Because the tree is backed by an array, I determine left and right children by performing arithmetic on the current index.
private Integer getLeftIdx(Integer rootIndex) {
return 2 * rootIndex + 1;
}
private Integer getRightIdx(Integer rootIndex) {
return 2 * rootIndex + 2;
}
I've realized that this can become really inefficient as the tree becomes unbalanced, partly because the array will be sparsely populated, and partly because the tree height will increase, causing searches to tend towards O(n).
I'm looking at ways to rebalance the tree, but I keep coming across algorithms like Day-Stout-Warren which seem to rely on a linked-list implementation for the tree.
Is this just the tradeoff for an array implementation? I can't seem to think of a way to rebalance without creating a second array.
Imagine you have an array of length M that contains N items (with N < M, of course) at various positions, and you want to redistributed them into "valid positions" without changing their order.
To do that you can first walk through the array from end to start, packing all the items together at the end, and then walk through the array from start to end, moving an item into each valid position you find until you run out of items.
This easy problem is the same as your problem, except that you don't want to walk though the array in "index order", you want to walk through it in binary in-order traversal order.
You want to move all the items into "valid positions", i.e. the part of the array corresponding to indexes < N, and you don't want to change their in-order traversal order.
So, walk the array in reverse in-order order, packing items into the in-order-last-possible positions. Then walk forward over the items in order, putting each item into the in-order-first available valid position valid position until you run out of items.
BUT NOTE: This is fun to consider, but it's not going to make your tree efficient for inserts -- you have to do too many rebalancings to keep the array at a reasonable size.
BUT BUT NOTE: You don't actually have to rebalance the whole tree. When there's no free place for the insert, you only have to rebalance the smallest subtree on the path that has an extra space. I vaguely remember a result that I think applies, which suggests that the amortized cost of an insert using this method is O(log^2 N) when your array has a fixed number of extra levels. I'll do the math and figure out the real cost when I have time.
I keep coming across algorithms like Day-Stout-Warren which seem to rely on a linked-list implementation for the tree.
That is not quite correct. The original paper discusses the case where the tree is embedded into an array. In fact, section 3 is devoted to the changes necessary. It shows how to do so with constant auxiliary space.
Note that there's a difference between their implementation and yours, though.
Your idea is to use a binary-heap order, where once you know a single-number index i, you can determine the indices of the children (or the parent). The array is, in general, not sorted in increasing indices.
The idea in the paper is to use an array sorted in increasing indices, and to compact the elements toward the beginning of the array on a rebalance. Using this implementation, you would not specify an element by an index i. Instead, as in binary search, you would indirectly specify an element by a pair (b, e), where the idea is that the index is implicitly specified as ⌊(b + e) / 2⌋, but the information allows you to determine how to go left or right.
I'm trying to find the point of a singly link list where a loop begins.
what I thought of was taking 2 pointers *slow, *fast one moving with twice the speed of other.
If the list has a loop then at some point
5-6-7-8
| |
1-2-3-4-7-7
slow=fast
Can there be another elegant solution so that the list is traversed only once?
Your idea of using two walkers, one at twice the speed of the other would work, however the more fundamental question this raises is are you picking an appropriate data structure? You should ask yourself if you really need to find the midpoint, and if so, what other structures might be better suited to achieve this in O(1) (constant) time? An array would certainly provide you with much better performance for the midpoint of a collection, but has other operations which are slower. Without knowing the rest of the context I can't make any other suggestion, but I would suggest reviewing your requirements.
I am assuming this was some kind of interview question.
If your list has a loop, then to do it in a single traversal, you will need to mark the nodes as visited as your fast walker goes through the list. When the fast walker encounters NULL or an already visited node, the iteration can end, and your slow walker is at the midpoint.
There are many ways to mark the node as visited, but an external map or set could be used. If you mark the node directly in the node itself, this would necessitate another traversal to clean up the mark.
Edit: So this is not about finding the midpoint, but about loop detection without revisiting already visited nodes. Marking works for that as well. Just traverse the list and mark the nodes. If you hit NULL, no loop. If you hit a visited node, there is a loop. If the mark includes a counter as well, you even know where the loop starts.
I'm assuming that this singly linked list is ending with NULL. In this case, slow pointer and fast pointer will work. Because fast pointer is double at speed of slow one, if fast pointer reaches end of list slow pointer should be at middle of it.
I am very puzzled about this. Everywhere there is written "linked lists are faster than arrays" but no one makes the effort to say WHY. Using plain logic I can't understand how a linked list can be faster. In an array all cells are next to each other so as long as you know the size of each cell it's easy to reach one cell instantly. For example if there is a list of 10 integers and I want to get the value in the fourth cell then I just go directly to the start of the array+24 bytes and read 8 bytes from there.
In the other hand when you have a linked list and you want to get the element in the fourth place then you have to start from the beginning or end of the list(depending on if it's a single or double list) and go from one node to the other until you find what you're looking for.
So how the heck can going step by step be faster than going directly to an element?
This question title is misleading.
It asserts that linked lists are faster than arrays without limiting the scope well. There are a number of times when arrays can be significantly faster and there are a number of times when a linked list can be significantly faster: the particular case of linked lists "being faster" does not appear to be supported.
There are two things to consider:
The theoretical bounds of linked-lists vs. arrays in a particular operation; and
the real-world implementation and usage pattern including cache-locality and allocations.
As far as the access of an indexed element: The operation is O(1) in an array and as pointed out, is very fast (just an offset). The operation is O(k) in a linked list (where k is the index and may always be << n, depending) but if the linked list is already being traversed then this is O(1) per step which is "the same" as an array. If an array traversal (for(i=0;i<len;i++) is faster (or slower) depends upon particular implementation/language/run-time.
However, if there is a specific case where the array is not faster for either of the above operations (seek or traversal), it would be interesting to see to be dissected in more detail. (I am sure it is possible to find a language with a very degenerate implementation of arrays over lists cough Haskell cough)
Happy coding.
My simple usage summary: Arrays are good for indexed access and operations which involve swapping elements. The non-amortized re-size operation and extra slack (if required), however, may be rather costly. Linked lists amortize the re-sizing (and trade slack for a "pointer" per-cell) and can often excel at operations like "chopping out or inserting a bunch of elements". In the end they are different data-structures and should be treated as such.
Like most problems in programming, context is everything. You need to think about the expected access patterns of your data, and then design your storage system appropriately. If you insert something once, and then access it 1,000,000 times, then who cares what the insert cost is? On the other hand, if you insert/delete as often as you read, then those costs drive the decision.
Depends on which operation you are referring to. Adding or removing elements is a lot faster in a linked list than in an array.
Iterating sequentially over the list one by one is more or less the same speed in a linked list and an array.
Getting one specific element in the middle is a lot faster in an array.
And the array might waste space, because very often when expanding the array, more elements are allocated than needed at that point in time (think ArrayList in Java).
So you need to choose your data structure depending on what you want to do:
many insertions and iterating sequentially --> use a LinkedList
random access and ideally a predefined size --> use an array
Because no memory is moved when insertion is made in the middle of the array.
For the case you presented, its true - arrays are faster, you need arithmetic only to go from one element to another. Linked list require indirection and fragments memory.
The key is to know what structure to use and when.
Linked lists are preferable over arrays when:
a) you need constant-time insertions/deletions from the list (such as in real-time computing where time predictability is absolutely critical)
b) you don't know how many items will be in the list. With arrays, you may need to re-declare and copy memory if the array grows too big
c) you don't need random access to any elements
d) you want to be able to insert items in the middle of the list (such as a priority queue)
Arrays are preferable when:
a) you need indexed/random access to elements
b) you know the number of elements in the array ahead of time so that you can allocate the correct amount of memory for the array
c) you need speed when iterating through all the elements in sequence. You can use pointer math on the array to access each element, whereas you need to lookup the node based on the pointer for each element in linked list, which may result in page faults which may result in performance hits.
d) memory is a concern. Filled arrays take up less memory than linked lists. Each element in the array is just the data. Each linked list node requires the data as well as one (or more) pointers to the other elements in the linked list.
Array Lists (like those in .Net) give you the benefits of arrays, but dynamically allocate resources for you so that you don't need to worry too much about list size and you can delete items at any index without any effort or re-shuffling elements around. Performance-wise, arraylists are slower than raw arrays.
Reference:
Lamar answer
https://stackoverflow.com/a/393578/6249148
LinkedList is Node-based meaning that data is randomly placed in memory and is linked together by nodes (objects that point to another, rather than being next to one another)
Array is a set of similar data objects stored in sequential memory locations
The advantage of a linked list is that data doesn’t have to be sequential in memory. When you add/remove an element, you are simply changing the pointer of a node to point to a different node, not actually moving elements around. If you don’t have to add elements towards the end of the list, then accessing data is faster, due to iterating over less elements. However there are variations to the LinkedList such as a DoublyLinkedList which point to previous and next nodes.
The advantage of an array is that yes you can access any element O(1) time if you know the index, but if you don’t know the index, then you will have to iterate over the data.
The down side of an array is the fact that its data is stored sequentially in memory. If you want to insert an element at index 1, then you have to move every single element to the right. Also, the array has to keep resizing itself as it grows, basically copying itself in order to make a new array with a larger capacity. If you want to remove an element in the begging, then you will have to move all the elements to left.
Arrays are good when you know the index, but are costly as they grow.
The reason why people talk highly about linked lists is because the most useful and efficient data structures are node based.
I would like to write a piece of code for inserting a number into a sorted array at the appropriate position (i.e. the array should still remain sorted after insertion)
My data structure doesn't allow duplicates.
I am planning to do something like this:
Find the right index where I should be putting this element using binary search
Create space for this element, by moving all the elements from that index down.
Put this element there.
Is there any other better way?
If you really have an array and not a better data structure, that's optimal. If you're flexible on the implementation, take a look at AA Trees - They're rather fast and easy to implement. Obviously, takes more space than array, and it's not worth it if the number of elements is not big enough to notice the slowness of the blit as compared to pointer magic.
Does the data have to be sorted completely all the time?
If it is not, if it is only necessary to access the smallest or highest element quickly, Binary Heap gives constant access time and logn addition and deletion time.
More over it can satisfy your condition that the memory should be consecutive, since you can implement a BinaryHeap on top of an array (I.e; array[2n+1] left child, array[2n+2] right child).
A heap based implementation of a tree would be more efficient if you are inserting a lot of elements - log n for both locating/removing and inserting operations.