Searching for the first free index - c

I have a big array / list of 1 million id and then I need to find the first free id that can be used . It can be assumed that there are couple modules which refer to this data structure and take an id ( during which it shall be marked as used ) and then return it later ( shall be marked as free ).
I want to know what different data structures can be used ? and what algorithm I can use to do this efficiently time and space (seperately).
Please excuse if its already present here, I did search before posting .

One initial idea that might work would be to store a priority queue of all the unused IDs, sorted so that low IDs are dequeued before high IDs. Using a standard binary heap, this would make it possible to return an ID to the unused ID pool in O(log n) time and to find the next free ID in O(log n) time as well. This has the disadvantage that it requires you to explicitly store all of the IDs, which could be space-inefficient if there are a huge number of IDs.
One potential space-saving optimization would be to try to coalesce consecutive ID values into ID ranges. For example, if you have free IDs 1, 3, 4, 5, 6, 8, 9, 10, and 12, you could just store the ranges 1, 3-6, 8-10, and 12. This would require you to change the underlying data structure a bit. Rather than using a binary heap, you could use a balanced binary search tree which stores the ranges. Since these ranges won't overlap, you can compare the ranges as less than, equal to, or greater than other ranges. Since BSTs are stored in sorted order, you can find the first free ID by taking the minimum element of the tree (in O(log n) time) and looking at the low end of its range. You would then update the range to exclude that first element, which might require you to remove an empty range from the the tree. When returning an ID to the pool of unused IDs, you could do a predecessor and successor search to determine the ranges that come immediately before and after the ID. If either one of them could be extended to include that ID, you can just extend the range. (You might need to merge two ranges as well). This also only takes O(log n) time.
Hope this helps!

A naive but efficient method would be to store all your ids in a stack.
Getting an id is a constant time operation : pop first item of the list.
When the task is over just push the id on the stack.
If the lowest free id must be returned (and not any free id) you can use a min heap with insertion and pop lowest in O(log N).

Try to use linked list (linked list of id's). Linkup all those list and the head should point to the free list (lets say at init all are free). Whenever the it'll be marked as used, remove it and place it at the end of the list and make the head point to the next free list. In this way, your list will be structured in a manner of "from free to used". You can also get a free list in O(1). Also, when an id is marked as free - put it as the first member of the linked list (as it's become free it's become usable) i.e make head point to this list. Hope this will helps!

Preamble: binary heap seems the best answer indeed. I'll present here an alternative, that may have advantages in some scenarios.
One possible way is to use a Fenwick Tree. You can store in each position either 0 or 1, indicating that a position was already used or not. And you can find the first empty position with a binary search (to find the first range [1..n] that has sum n-1). The complexity of this operation is O(log^2 n), which is worse than a binary heap, but this approach has another advantages:
You can implement a Fenwick Tree in less than 10 lines of code
You can now calculate the density (number of used / total ids) of a range in O(log n)

If you do not strictly need the lowest id, you can allocate ids to modules in batches of a 1000. When freeing ids, they can be added to the back of the list. And once in a while you would sort the list, to make sure that again, the ids you assign are from the low end.

Well, an array probably isn't the best structure. An Hash would be better, speedwise at least. As for the structure for each "node", all that I can see you need is just the id, and wether it is being used, or not.

Related

Why is looking for an item in a hash map faster than looking for an item in an array?

You might have come across someplace where it is mentioned that it is faster to find elements in hashmap/dictionary/table than list/array. My question is WHY?
(inference so far I made: Why should it be faster, as far I see, in both data structure, it has to travel throughout till it reaches the required element)
Let’s reason by analogy. Suppose you want to find a specific shirt to put on in the morning. I assume that, in doing so, you don’t have to look at literally every item of clothing you have. Rather, you probably do something like checking a specific drawer in your dresser or a specific section of your closet and only look there. After all, you’re not (I hope) going to find your shirt in your sock drawer.
Hash tables are faster to search than lists because they employ a similar strategy - they organize data according to the principle that every item has a place it “should” be, then search for the item by just looking in that place. Contrast this with a list, where items are organized based on the order in which they were added and where there isn’t a a particular pattern as to why each item is where it is.
More specifically: one common way to implement a hash table is with a strategy called chained hashing. The idea goes something like this: we maintain an array of buckets. We then come up with a rule that assigns each object a bucket number. When we add something to the table, we determine which bucket number it should go to, then jump to that bucket and then put the item there. To search for an item, we determine the bucket number, then jump there and only look at the items in that bucket. Assuming that the strategy we use to distribute items ends up distributing the items more or less evenly across the buckets, this means that we won’t have to look at most of the items in the hash table when doing a search, which is why the hash table tends to be much faster to search than a list.
For more details on this, check out these lecture slides on hash tables, which fills in more of the details about how this is done.
Hope this helps!
To understand this you can think of how the elements are stored in these Data structures.
HashMap/Dictionary as you know it is a key-value data structure. To store the element, you first find the Hash value (A function which always gives a unique value to a key. For example, a simple hash function can be made by doing the modulo operation.). Then you basically put the value against this hashed key.
In List, you basically keep appending the element to the end. The order of the element insertion would matter in this data structure. The memory allocated to this data structure is not contiguous.
In Array, you can think of it as similar to List. But In this case, the memory allocated is contiguous in nature. So, if you know the value of the address for the first index, you can find the address of the nth element.
Now think of the retrieval of the element from these Data structures:
From HashMap/Dictionary: When you are searching for an element, the first thing that you would do is find the hash value for the key. Once you have that, you go to the map for the hashed value and obtain the value. In this approach, the amount of action performed is always constant. In Asymptotic notation, this can be called as O(1).
From List: You literally need to iterate through each element and check if the element is the one that you are looking for. In the worst case, your desired element might be present at the end of the list. So, the amount of action performed varies, and in the worst case, you might have to iterate the whole list. In Asymptotic notation, this can be called as O(n). where n is the number of elements in the list.
From array: To find the element in the array, what you need to know is the address value of the first element. For any other element, you can do the Math of how relative this element is present from the first index.
For example, Let's say the address value of the first element is 100. Each element takes 4 bytes of memory. The element that you are looking for is present at 3rd position. Then you know the address value for this element would be 108. Math used is
Addresses of first element + (position of element -1 )* memory used for each element.
That is 100 + (3 - 1)*4 = 108.
In this case also as you can observe the action performed is always constant to find an element. In Asymptotic notation, this can be called as O(1).
Now to compare, O(1) will always be faster than O(n). And hence retrieval of elements from HashMap/Dictionary or array would always be faster than List.
I hope this helps.

Use of memory between an array and a linked list

In C, which is more efficient in terms of memory management, a linked list or an array?
For my program, I could use one or both of them. I would like to take this point into consideration before starting.
Both link list and array have good and bad sides.
Array
Accessing at a particular position take O(1) time, because memory initialized is consecutive for array. So if address of first position is A, then address of 5th element if A+4.
If you want to insert a number at some position it will take O(n) time. Because you have to shift every single numbers after that particular position and also increase size of array.
About searching an element. Considering the array is sorted. you can do a binary search and accessing each position is O(1). So you do the search in order of binary search. In case the array is not sorted you have to traverse the entire array so O(n) time.
Deletion its the exact opposite of insertion. You have to left shift all the numbers starting from the place where you deleted it. You might also need to recrete the array for memory efficiency. So O(n)
Memory must be contiguous, which can be a problem on old x86 machines
with 64k segments.
Freeing is a single operation.
LinkList
Accessing at a particular position take O(n) time, because you have to traverse the entire list to get to a particular position.
If you want to insert a number at some position and you have a pointer at that position already, it will take O(1) time to insert the new value.
About searching an element. No matter how the numbers are arranged you have to traverse the numbers from front to back one by one to find your particular number. So its always O(n)
about deletion its the exact opposite of insertion. If you know the position already by some pointer suppose the list was like this . p->q->r you want to delete q all you need is set next of p to r. and nothing else. So O(1) [Given you know pointer to p]
Memory is dispersed. With a naive implementation, that can be bad of cache coherency, and overall take can be high because the memory allocation system has overhead for each node. However careful programming can get round this problem.
Deletion requires a separate call for each node, however again careful programming can get round this problem.
So depending on what kind of problem you are solving you have to choose one of the two.
Linked list uses more memory, from both the linked list itself and inside the memory manager due to the fact you are allocating many individual blocks of memory.
That does not mean it is less efficient at all, depending on what you are doing.
While a linked list uses more memory, adding or removing elements from it is very efficient, as it doesn't require moving data around at all, while resizing a dynamic array means you have to allocate a whole new area in memory to fit the new and modified array with items added/removed. You can also sort a linked list without moving it's data.
On the other hand, arrays can be substantially faster to iterate due to caching, path prediction etc, as the data is placed sequentially in memory.
Which one is better for you will really depend on the application.

How to design inserting to an infinite array

Problem statement
Imagine we have an infinite array, where we store integers. When n elements are in the array, only n first cells are used, the rest is empty.
I'm trying to come up with a data structure / algorithm that is capable of:
checking whether an element is stored
inserting a new element if it is not already stored
deleting an element if it is stored
Each operation has to be in O(sqrt(n)).
Approach 1
I've come across this site, where the following algorithm was presented:
The array is (virtually, imagine this) divided into subarrays. Their lengths are 1, 4, 9, 16, 25, 36, 49, etc. the last subarray is not a perfect square - it may not be filled entirely.
Assumption that, when we consider those subarrays as sets - they're in increasing order. So all elements of a heap that is further to the right are greater than any element from heaps on their left.
Each such subarray represents a binary heap. A max heap.
Lookup: go to first indexes of heaps (so again 1, 4, 9, 16, ...) and go as long as you find the first heap with its max (max is stored on those indexes) is greater than your number. Then check this subarray / heap.
Insert: once you do the lookup, insert the element to the heap where is should be. When the heap is full - take the greatest element and insert it to the next heap. And so on.
Unfortunately, this solution is O(sqrt(n) * log(n)).
How to make it pure O(sqrt(n))?
Idea 2
Since all the operations require the lookup to be performed, I imagine that inserting and deleting would both be O(1). Just a guess. And probably once inserting is done, deleting will be obvious.
Clarification
What does the infinite array mean?
Basically, you can store any number of elements in it. It is infinite. However, there are two restrictions. First - one cell can only store one element. Second - when the array stores currently n elements, only n first cells can be used.
What about the order?
It does not matter.
Have you considered a bi-parental heap (aka: BEAP)?
The heap maintains a height of sqrt(n), which means that insert, find, and remove all run in O(sqrt(n)) in the worst case.
These structures are described in Munro and Suwanda's 1980 paper Implicit data structures for fast search and update.
Create a linked list to a set of k arrays which represent hash tables.
Per the idea of the first site, let the hash tables be sized to contain 1, 4, 9, 16, 25, 36, 49, ... elements.
The data structure therefore contains N=k*(k+1)*(2*k+1)/6=O(k^3) (this is the result of a well-known summation formula for adding squares) elements with k hash tables.
You can then successively search each hash table for elements. The hash check, insert, and delete operations all work in O(1) time (assuming separate chaining so that deletions can be handled gracefully), and, since k<sqrt(N) (less than the cubic root, actually), this fulfills the time requirements of your algorithm.
If a hash table is full, add an additional one to the linked list. If a hash table is empty, remove it from the list (add it back in if necessary later). List insertion/deletion is O(1) for a doubly-linked list, so this does not affect the time complexity.
Note that this improves on other answers which suggest a straight-out hash table because rehashing will not be required as the data structure grows.
I think approach 1 works, I just think some of the math is wrong.
The number of sub arrays is not O(sqrt(n)) it's O(cuberoot(n))
So, you get O(log(n)*n^(1/3)) = O( (log(n) / n^(1/6)) * n^(1/2) ) and since lim(log(n) / n^(1/6)) = 0 we get O( (log(n) / n^(1/6)) * n^(1/2) ) < O(sqrt(n))
My CS is a bit rusty, so you'll have to double check this. Please let me know if I got this wrong.
The short answer is that fulfilling all of your requirements is impossible for the simple fact that an array is a representation of elements ordered by index; and if you want to keep the first n elements referenced by the first n indexes, as you say, any deletion can potentially require re-indexing (that is shifting elements up the array) on the order of O(n) operations.
(That said, ignoring deletion, this was my earlier proposal: Since your array is infinite, perhaps you won't mind if I bend one of the rules a little. Think of your array as similar to memory addresses in a computer, then build a balanced binary tree, consigning a block of array elements for each node (I'm not too experienced with trees but I believe you'll need a block of four elements, two for children, one for the value, and one for the height). The elements reserved for the children will simply contain the starting indexes in the array for the children blocks (nodes). You will use 4n = O(n) instead of strictly n space for the first n elements (bending your rule a little), and have orders of magnitude better complexity since the operations on a BST would be O(log2 n). (Instead of assigning blocks of elements, node construction could also be done by dividing each array element into sections of bits, of which you would likely have enough of in a theoretically infinite scenario.)
Since you are storing integers just make the array 4 billion ints wide. Then when you add an element increment the integer equal to the element by 1. You will be able to add, remove, checking for the element will take O(1) time. It's basically just a hash table without the hash.

What data structure to use to emulate an array in which one can add data in any position?

I want to store a small amount of items( less than 255) which have constant size (a c char )and be able to do the following operations:
Append a value to an arbitrary position and have the other items preserve their previous order.
Delete an item and have the other items preserve their order(as above).
Find the next and previous of an item.
I have tried using an array and making a function to add a value by moving all items after it a place forward.Same thing can happen with deleting, but it is too inefficient.Of course, I do not mind having to use a library, long as it is readily available and free.
Array - access: O(1), insert: O(n)
Double-linked list - access O(n), previous/next: O(1), insert(*): O(1)
RB tree with number of childs stored: O(log n) for all operations.
(*): You need the traverse the list first to get to the position (O(n)).
Note: no, the array is not messy, it's really simple to implement. Also as you can see, depending on the usage, it can be quite efficient.
Based on the number of elements, and your remark to array implementation you should stick to arrays.
You could use a double-linked list for it. However, this won't work if you want to keep the array behaviour (e.g. accessing elements quickly (O(1), for a LL it's O(n)) by their index)

How to preserve the order of elements of the same priority in a priority queue implemented as binary heap?

I have created a binary heap, which represents a priority queue. It's just classical well known algorithm. This heap schedules a chronological sequence of different events ( the sort key is time ).
It supports 2 operation: Insert and Remove. Each node's key of the heap is greater than or equal to each of its children. However, adding events with the same key doesn't preserve the order they were added, because each time after Remove or Insert were called, the heap-up and the heap-down procedures break the order.
My question is: what should be changed in a classical algorithm to preserve the order of the nodes with the same priority?
One solution is to add time of insertion attribute to the inserted element. That may be just a simple counter incremented each time a new element is inserted into the heap. Then when two elements are equal by priority, compare the time of insertion.
As far as I know, heaps are never built to preserve order (which is why "heap sort" is notable for not being a stable sort).
I understand that what you are asking is whether a small algorithmic trick might be able to change this (that is not the good old reliable "timestamp" solution). I don't think it's possible.
What I would have suggested is some version of this:
keep the same "insert";
modify "remove" so that it ensures a certain order on elements of a given priority.
To do this, in heap-down, instead of swapping elements down until the order is preserved: swap an element down until it as the end of an arborescence of elements of the same value, always choosing to go to the right when you can.
Unfortunately the problem with this is that you don't know where insert will add an element of a given priority: it could end up anywhere in the tree. Changing this would be, I believe, more than just a tweak to the structure.
If the elements are inserted in chronological order and this order is maintained (for example by having "append" rather than "insert" and "remove_and_pack" rather than just "remove") you could use the memory address (cast to an unsigned 32- or 64-bit integer depending on environment) of the element as the final comparison step. Early elements will have numerically lower addresses than later ones.

Resources