Linked list insertion/deletion efficiency - arrays

Traditionally, linked lists are recommended over arrays when we want to perform insertions/deletions at random locations.This is because while using linked list(singly linked list), we just have to change the next and previous pointers of the adjacent nodes. Whereas in arrays, we have to shove numerous elements to make space for the new element(in case of insertion).
However, the process of finding the location of insertion/deletion in case of linked list is very costly(sequential search) as compared to arrays(random access), specially when we have large data.
Does this factor significantly decrease the efficiency of insertion/deletion in a linked lists over arrays? Or is the time required to shove the elements in case of an array a bigger problem than sequential access?

However, the process of finding the location of insertion/deletion in
case of linked list is very costly(sequential search) as compared to
arrays(random access), specially when we have large data.
Random access doesn't help anything if you are searching for an element and don't know where it is, and if you do know where it is and have like a pointer or index to it, there's no longer a search involved to access the element whether you're using linked lists or arrays. Only case where random-access helps in this case is if the array is sorted, in which case the random-access enables a binary search.
Does this factor significantly decrease the efficiency of
insertion/deletion in a linked lists over arrays? Or is the time
required to shove the elements in case of an array a bigger problem
than sequential access?
Generally not at least with unordered sequences since, again, both arrays and linked lists require a linear-time search to find an element. And if people had a need for searching frequently in their critical paths for non-trivial input sizes, often people use hash tables or balanced binary trees or tries or something of that sort instead.
Often arrays are preferred over linked lists in a lot of performance-critical fields for reasons that don't relate to algorithmic complexity. It's because arrays are guaranteed to contiguously store their elements. That provides very good locality of reference for sequential processing.
There are also ways to remove from arrays in constant-time. As one example, if you want to remove the nth element from an array in constant-time, just swap it with the last element in the array and remove the last one in constant-time. You don't necessarily have to shuffle all the elements over to close a gap if you're allowed to reorder the elements.
Linked lists may or may not store their nodes contiguously. They often become a lot more useful in performance-critical contexts if they do, like if they store their nodes in an array (either through an array-based container or allocator). Otherwise traversing them can lead to cache misses galore with potentially a cache miss for every single node being accessed.

the process of finding the location of insertion/deletion in case of linked list is very costly(sequential search) as compared to arrays (random access)
The comparison is wrong, since you are comparing the efficiency of insertion/deletion operations. Instead compare these two factors:
Sequential search in a linked-list having time complexity O(n)
Copy array elements in order to shove. May require to copy upto n number of array elements.
In array : If the underlying type is POD it can just realloc, but if not it must move them with the object's operator=.
So you can see that not everything is in favor of array usage. Linked List obviates the need to copy the same data again and again.
, specially when we have large data.
That would mean more number of array elements to be copied while shoving.

Related

Why is Merge sort better for large arrays and Quick sort for small ones?

The only reason I see for using merge sort over quick sort is if the list was already (or mostly) sorted.
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
It would seem unintuitive to say that because of large or small data sets one is better than the other.
For example, quoting Geeksforgeeks article on that:
Merge sort can work well on any type of data sets irrespective of its size (either large or small).
whereas
The quick sort cannot work well with large datasets.
And next it says:
Merge sort is not in place because it requires additional memory space to store the auxiliary arrays.
whereas
The quick sort is in place as it doesn’t require any additional storage.
I understand that space complexity and time complexity are separate things. But it still is an extra step, and of course the fact that writing everything on a new array with large data sets it would take more time.
As for the pivoting problem, the bigger the data set, the lower the chance of picking the lowest or highest item (unless, again, it's an almost sorted list).
So why is it considered that merge sort is a better way of sorting large data sets instead of quick sort?
Why is Merge sort better for large arrays and Quick sort for small ones?
It would seem unintuitive to say that because of large or small data sets one is better than the other.
Assuming the dataset fits in memory (not paged out), the issue is not the size of the dataset, but a worst case pattern for a particular implementation of quick sort that result in O(n2) time complexity. Quick sort can use median of medians to guarantee worst case time complexity is O(n log(n)), but that ends up making it significantly slower than merge sort. An alternative is to switch to heap sort if the level of recursion becomes too deep, known as intro sort, and is used in some libraries.
https://en.wikipedia.org/wiki/Median_of_medians
https://en.wikipedia.org/wiki/Introsort
Merge sort requires more space as it creates an extra array for storing, and no matter what it will compare every item.
There are variations of merge sort that don't require any extra storage for data, but they tend to be about 50+% slower than standard merge sort.
Quick sort on the other hand does not require extra space, and doesn't swap or compare more than necessary.
Every element of a sub-array will be compared to the pivot element. As the number of equal elements increases, Lomuto partition scheme gets worse, while Hoare partition scheme gets better. With a lot of equal elements, Hoare partition scheme needlessly swaps equal elements, but the check to avoid the swaps generally costs more time than just swapping.
sorting an array of pointers to objects
Merge sort does more moves but fewer compares than quick sort. If sorting an array of pointers to objects, only the pointers are being moved, but comparing objects requires deference of the pointers and what is needed to compare objects. In this case and other cases where compare takes more time than moves, merge sort is faster.
large datasets that don't fit in memory
For datasets too large to fit in memory, a memory base sort is used to sort "chunks" of the dataset that will fit into memory then written to external storage. Then the "chunks" on external storage are merged using a k-way merge to produce a sorted dataset.
https://en.wikipedia.org/wiki/External_sorting
I was trying to figure out which sorting algorithm (Merge/Quick) has a better time and memory efficiency when the input data size becomes increasing, Then I write a code that generates a list of random numbers and sorts the list by both algorithms. after that, the program generates 5 txt files that record the random numbers with 1M,2M,3M,4M,5M length (M- stands for Millions)then I got the following results.
Execution Time in seconds:
Execution Time in seconds graphical Interpretation:
Memory Usage in KB:
Memory Usage in KB Graphical Interpretation:
if you want the code here is the Github repo. https://github.com/Nikodimos/Merge-and-Quick-sort-algorithm-using-php
In my scenario Merge sort becomes efficient when the file size becomes increase.
In addition to rcgldr's detailed response I would like to underscore some extra considerations:
large and small is quite relative: in many libraries, small arrays (with fewer than 30 to 60 elements) are usually sorted with insertion sort. This algorithm is simpler and optimal if the array is already sorted, albeit with a quadratic complexity in the worst case.
in addition to space and time complexities, stability is a feature that may be desirable, even necessary in some cases. Both Merge Sort and Insertion Sort are stable (elements that compare equal remain in the same relative order), whereas it is very difficult to achieve stability with Quick Sort.
As you mentioned, Quick Sort has a worst case time complexity of O(N2) and libraries do not implement median of medians to curb this downside. Many just use median of 3 or median of 9 and some recurse naively on both branches, paving the way for stack overflow in the worst case. This is a major problem as datasets can be crafted to exhibit worst case behavior, leading to denial of service attacks, slowing or even crashing servers. This problem was identified by Doug McIlroy in his famous 1999 paper A Killer Adversary for Quicksort. Implementations are available and attacks have been perpetrated using this technique (cf this discussion).
Almost sorted arrays are quite common in practice and neither Quick sort nor Merge sort treat them really efficiently. Libraries now use more advanced combinations of techniques such as Timsort to achieve better performance and stability.

Time complexity of sequentially scanning an array vs a linked list?

Since the elements of an array are stored contiguously in memory, I understand that sequentially scanning all the elements of an array would be much faster than in a linked list of same size. For the former you only have to increment some index variable and then read the value at that index whereas for LL you have to read pointers and then fetch data at non-contiguous memory addresses. My question is specifically how we would categorise these two cases in terms of time complexity?
For scanning array, does perfoming n * random accesses i.e. O(1) operations mean that overall it becomes O(n)? In that case wouldn't both be O(n)?
Sorry if this question doesn't make sense, maybe my understanding of time complexities isn't so good.
You are correct that
sequentially scanning a linked list or an array takes time O(n), and that
it's much faster to scan an array than a linked list due to locality of reference.
How can this be? The issue has to do with what you're counting with that O(n). Typically, when doing an analysis of an algorithm we assume that looking up a location in memory takes time O(1), and since in both cases you're doing O(n) memory lookups the total time is O(n).
However, the assumption that all memory lookups take the same amount of time is not a very good one in practice, especially with linked structures. You sometimes see analyses performed that do this in a different way. We assume that, somewhere in the machine, there's a cache that can hold B elements at any one time. Every time we do a memory lookup, if it's in cache, it's (essentially) free, and if it's not in cache then we do some work to load that memory address - plus the contents of memory around that location - into cache. We then only care about how many times we have to load something into the cache, since that more accurately predicts the runtime.
In the case of a linked list, where cells can be scattered randomly throughout memory, we'd expect to do Θ(n) memory transfers when scanning a linked list, since we basically will never expect to have a linked list cell already in cache. However, with an array, every time we find an array element not in cache, we pull into cache all the adjacent memory locations, which then means the next few elements will definitely be in the cache. This means that only (roughly) every 1/B lookups will trigger a cache miss, so we expect to do Θ(n / B) memory transfers. This predicts theoretically what we see empirically - it's much faster to sequentially scan an array than a linked list.
So to summarize, this is really an issue of what you're measuring and how you measure it. If you just count memory accesses, both sequential accesses will require O(n) memory accesses and thus take O(n) time. However, if you just care about cache misses, then sequential access of a dynamic array requires Θ(n / B) transfers while a linked list scan requires Θ(n) transfers, which is why the linked list appears to be slower.
This sort of analysis is often used when designing data structures for databases. The B-tree (and its relative the B+-tree), which are used extensively in databases and file systems, are specifically tuned around the size of a disk page to reduce memory transfers. More recent work has been done to design "cache-oblivious" data structures that always take optimal advantage of the cache even without knowing its size.
Unfortunatelly you missunderstood how these things work.
Sequencialy scan all array elements is O(n), being n the size of
the array, since you will visit each element. You will need to calculate each address and the fetch the data n times;
Sequencialy scan all linked list elements is O(n), being n the size of the linked list, since you will visit each element throught the links;
Acessing one element of an array is O(1), since the access is related to one memory address calculation and one fetch process;
Acessing one element of an linked lisk is O(n), being n the position that you want to access, because you need to arrive on the nth element hopping each link until you reach the desired element.
Accessing the value at a certain index, let's say, 500, in an array is "immediate" (O(1)) while, with a linked list, you must iterate over 500 nodes to get the wanted one (O(n)).
Therefore, with an array, an indice at the beginning or at the end of the container is accessible at the same speed while with a linked list, more the index is high more it takes time to get it.
At the contrary, inserting a node in a linked list is easy and fast, while doing the same in an array is slower.
So the question becomes what is the more common operation : accessing indices (writing, reading) or manipulating the container structure (inserting, removing) ? The answer seems obvious but it could be some cases where it's not.

Pros/Cons of using STL lists and vectors, linked lists and arrays

I'm wondering what the difference is between STL:list, STL:vector, array and linked list on a basic level in comparison to one another.
My understanding is that, generally, linked lists allow for grow-able lists, inserts and deletes are much easier but it takes longer to direct access a single element in the linked list since you would need to traverse through every element.
I am probably missing many other key differences so you could also point some more obvious ones out as well.
How do lists and vectors come into play in comparison and when would you choose one over the other?
Below are some differences between lists and vectors.
Time taken for Insertion: Lists take constant time to insert an
element into them, where as vectors internally need to relocate the
data present in the vector before inserting a new value if the
capacity of vector is equal to number of elements present in the
vector. This takes a overhead to the processor and time.
Time taken to access the data: Vectors has upper hand over lists in
this aspect. Vector takes constant time to access elements present
at the middle. Where as lists need to traverse through the list to
reach the required element.
Memory taken by the container: The capacity and size of vector
necessarily need not be same. The capacity of vector grows
exponentially intern consuming more memory than actually needed by
the container. Lists take exactly same memory required to store
elements, hence no extra memory is allocated during allocation by
which you can save memory.

The time it takes to initilize a Linked list vs an array

For a project we have to optimize code to calculate a certain prime number. My question is in terms of time how much faster or slower is it to initialize a Linked list in comparison to an array. Obviously with smaller amounts of data it is negligible however we are working with 100,000,000 different data points so in this case it does make a difference.
If you allocate ALL 100,000,000 points at once, array allocation is a single, constant time operation.
For linked lists, if you're allocating ALL numbers at once, then your lookups will be prohibitively slow.
If optimizing for space : use the linked list, since it will never be any larger than necessary, and deletions can easily free up memory. If optimizing for time, the array will be faster for lookups.
If you have to choose quickly, then the array is probably the way to go. I've never seen a complex mathematical algorithm implemented in a linked list. Linked lists are usually just good for learning the fundamentals of memory, computation, and pointers -- they don't seem to have much algorithmic value.
Finally, with collection types, you might get the best of both worlds - the initial time will be small for initialization, but once the collection gets to a given size, there might be a moment where it has to reallocate some new memory.
Practically speaking
I rarely find that a linked list or arrays are all that necessary ... Most jobs do alot better with a abstract collection type (i.e. a vector). But I am surmising that this is a homework assignment... Maybe consider adding that tag to the post ?
An array will be far far far faster than a linked list to initialise (with data or without), but if you're removing and/or inserting data in the middle or at the beginning of the array (very often or at all depending on the size), the speed you gained from the initialisation will be lost by moving elements around to fill the holes you leave from removing an item or for making a new hole for a new item.
1) If you initialize your ArrayList with a capacity of 100,000,000 elements, all the required memory will be allocated immediately and the array never has to expand, so it will be faster.
2) With a link list, more memory would be required for the same number of elements, as there is overhead related to the next pointers. So initializing the list will also take longer.
In terms of how much faster is Arraylist, you can do a simple experiment in Java and figure it out.
Edit: I am sorry I assumed it to be an arrayList instead of an array. But, point #2 still applies.

Array-Based vs List-Based Stacks and Queues

I'm trying to compare the growth rates (both run-time and space) for stack and queue operations when implemented as both arrays and as linked lists. So far I've only been able to find average case run-times for queue pop()s, but nothing that comprehensively explores these two data structures and compares their run-times/space behaviors.
Specifically, I'm looking to compare push() and pop() for both queues and stacks, implemented as both arrays and linked lists (thus 2 operations x 2 structures x 2 implementations, or 8 values).
Additionally, I'd appreciate best, average and worst case values for both of these, and anything relating to the amount of space they consume.
The closest thing I've been able to find is this "mother of all cs cheat sheets" pdf that is clearly a masters- or doctoral-level cheat sheet of advanced algorithms and discrete functions.
I'm just looking for a way to determine when and where I should use an array-based implementation vs. a list-based implementation for both stacks and queues.
There are multiple different ways to implement queues and stacks with linked lists and arrays, and I'm not sure which ones you're looking for. Before analyzing any of these structures, though, let's review some important runtime considerations for the above data structures.
In a singly-linked list with just a head pointer, the cost to prepend a value is O(1) - we simply create the new element, wire its pointer to point to the old head of the list, then update the head pointer. The cost to delete the first element is also O(1), which is done by updating the head pointer to point to the element after the current head, then freeing the memory for the old head (if explicit memory management is performed). However, the constant factors in these O(1) terms may be high due to the expense of dynamic allocations. The memory overhead of the linked list is usually O(n) total extra memory due to the storage of an extra pointer in each element.
In a dynamic array, we can access any element in O(1) time. We can also append an element in amortized O(1), meaning that the total time for n insertions is O(n), though the actual time bounds on any insertion may be much worse. Typically, dynamic arrays are implemented by having most insertions take O(1) by appending into preallocated space, but having a small number of insertions run in Θ(n) time by doubling the array capacity and copying elements over. There are techniques to try to reduce this by allocating extra space and lazily copying the elements over (see this data structure, for example). Typically, the memory usage of a dynamic array is quite good - when the array is completely full, for example, there is only O(1) extra overhead - though right after the array has doubled in size there may be O(n) unused elements allocated in the array. Because allocations are infrequent and accesses are fast, dynamic arrays are usually faster than linked lists.
Now, let's think about how to implement a stack and a queue using a linked list or dynamic array. There are many ways to do this, so I will assume that you are using the following implementations:
Stack:
Linked list: As a singly-linked list with a head pointer.
Array: As a dynamic array
Queue:
Linked list: As a singly-linked list with a head and tail pointer.
Array: As a circular buffer backed by an array.
Let's consider each in turn.
Stack backed by a singly-linked list. Because a singly-linked list supports O(1) time prepend and delete-first, the cost to push or pop into a linked-list-backed stack is also O(1) worst-case. However, each new element added requires a new allocation, and allocations can be expensive compared to other operations.
Stack backed by a dynamic array. Pushing onto the stack can be implemented by appending a new element to the dynamic array, which takes amortized O(1) time and worst-case O(n) time. Popping from the stack can be implemented by just removing the last element, which runs in worst-case O(1) (or amortized O(1) if you want to try to reclaim unused space). In other words, the most common implementation has best-case O(1) push and pop, worst-case O(n) push and O(1) pop, and amortized O(1) push and O(1) pop.
Queue backed by a singly-linked list. Enqueuing into the linked list can be implemented by appending to the back of the singly-linked list, which takes worst-case time O(1). Dequeuing can be implemented by removing the first element, which also takes worst-case time O(1). This also requires a new allocation per enqueue, which may be slow.
Queue backed by a growing circular buffer. Enqueuing into the circular buffer works by inserting something at the next free position in the circular buffer. This works by growing the array if necessary, then inserting the new element. Using a similar analysis for the dynamic array, this takes best-case time O(1), worst-case time O(n), and amortized time O(1). Dequeuing from the buffer works by removing the first element of the circular buffer, which takes time O(1) in the worst case.
To summarize, all of the structures support pushing and popping n elements in O(n) time. The linked list versions have better worst-case behavior, but may have a worse overall runtime because of the number of allocations performed. The array versions are slower in the worst-case, but have better overall performance if the time per operation isn't too important.
These aren't the only ways you can implement lists. You could have an unrolled linked list, where each linked list cell holds multiple values. This slightly increases the locality of reference of the lookups and decreases the number of allocations used. Other options (using a balanced tree keyed by index, for example) represent a different set of tradeoffs.
Sorry if I misunderstood your question, but if I didn't, than I believe this is the answer you are looking for.
With a vector, you can only efficiently add/delete elements at the end of the container.
With a deque, you can efficiently add/delete elements at the beginning/end of the container.
With a list, you can efficiently insert/delete elements anywhere in the container.
vectors/deque allow for random access iterators.
lists only allow sequential access.
How you need to use and store the data is how you determine which is most appropriate.
EDIT:
There is a lot more to this, my answer is very generalized. I can go into more depth if I'm even on track of what your question is about.

Resources