Optimal method for handling a changing array in Fortran - arrays

Let's say I have an 2D array. Along the first axis I have a series of properties for one individual measurement. Along the second axis I have a series of measurements.
So, for example, the array could look something like this:
personA personB personC
height 1.8 1.75 2.0
weight 60.5 82.0 91.3
age 23.1 65.8 48.5
or anything similar.
I want to change the size of the array very often - for example, ignoring personB's data and including personD and personE. I will be looping through "time", probably with >10^5 timesteps. Each timestep, there is a chance that each "person" in the array could be deleted and a chance that they will introduce several new people into the simulation.
From what I can see there are several ways to manage an array like this:
Overwriting and infrequent reallocation
I could use a very large array with an extra column, in which I put a "skip" flag. So, if I decide I no longer need personB, I set the flag to 1 and ignore personB every time I loop through the list of people. When I need to add personD, I search through the list for the first person with skip == 1, replace the data with the data for personD, and set skip = 0. If there aren't any people with skip == 1, I copy the array, deallocate it, reallocate it with several more columns, and then fill the first new column with personD's data.
Advantages:
infrequent allocation - possibly better performance
easy access to array elements
easier to optimise
Disadvantages:
if my array shrinks a lot, I'll be wasting a lot of memory
I need a whole extra row in the data, and I have to perform checks to make sure I don't use the irrelevant data. If the array shrinks from 1000 people to 1, I'm going to have to loop through 999 extra records
could encounter memory issues, if I have a very large array to copy
Frequent reallocation
Every time I want to add or remove some data, I copy and reallocate the entire array.
Advantages:
I know every piece of data in the array is relevant, so I don't have to check them
easy access to array elements
no wasted memory
easier to optimise
Disadvantages:
probably slow
could encounter memory issues, if I have a very large array to copy
A linked list
I refactor everything so that each individual's data includes a pointer to the next individual's data. Then, when I need to delete an individual I simply remove it from the pointer chain and deallocate it, and when I need to add an individual I just add some pointers to the end of the chain.
Advantages:
every record in the chain is relevant, so I don't have to do any checks
no wasted memory
less likely to encounter memory problems, as I don't have to copy the entire array at once
Disadvantages:
no easy access to array elements. I can't slice with data(height,:), for example
more difficult to optimise
I'm not sure how this option will perform compared to the other two.
--
So, to the questions: are there other options? When should I use each of these options? Is one of these options better than all of the others in my case?

Related

Is it possible to implement a dynamic array without reallocation?

The default way to implement dynamic arrays is to use realloc. Once len == capacity we use realloc to grow our array. This can cause copying of the whole array to another heap location. I don't want this copying to happen, since I'm designing a dynamic array that should be able to store large amount of elements, and the system that would run this code won't be able to handle such a heavy operation.
Is there a way to achieve that?
I'm fine with loosing some performance - O(logN) for search instead of O(1) is okay. I was thinking that I could use a hashtable for this, but it looks like I'm in a deadlock since in order to implement such a hashtable I would need a dynamic array in the first place.
Thanks!
Not really, not in the general case.
The copy happens when the memory manager can't increase the the current allocation, and needs to move the memory block somewhere else.
One thing you can try is to allocate fixed sized blocks and keep a dynamic array pointing to the blocks. This way the blocks don't need to be reallocated, keeping the large payloads in place. If you need to reallocate, you only reallocate the array of reference which should be much cheaper (move 8 bytes instead 1 or more MB). The ideal case the block size is about sqrt(N), so it's not working in a very general case (any fixed size will be some large or some small for some values).
If you are not against small allocations, you could use a singly linked list of tables, where each new table doubles the capacity and becomes the front of the list.
If you want to access the last element, you can just get the value from the last allocated block, which is O(1).
If you want to access the first element, you have to run through the list of allocated blocks to get to the correct block. Since the length of each block is two times the previous one, this means the access complexity is O(logN).
This data structures relies on the same principles that dynamic arrays use (doubling the size of the array when expanding), but instead of copying the values after allocating a new block, it keeps track of the previous block, meaning accessing the previous blocks adds overhead but not accessing the last ones.
The index is not a position in a specific block, but in an imaginary concatenation of all the blocks, starting from the first allocated block.
Thus, this data structure cannot be implemented as a recursive type because it needs a wrapper keeping track of the total capacity to compute which block is refered to.
For example:
There are three blocks, of sizes 100, 200, 400.
Accessing 150th value (index 149 if starting from 0) means the 50th value of the second block. The interface needs to know the total length is 700, compare the index to 700 - 400 to determine whether the index refers to the last block (if the index is above 300) or a previous block.
Then, the interface compares with the capacity of the previous blocks (300 - 200) and knows 150 is in the second block.
This algorithm can have as many iterations as there are blocks, which is O(logN).
Again, if you only try to access the last value, the complexity becomes O(1).
If you have concerns about copy times for real time applications or large amounts of data, this data structure could be better than having a contiguous storage and having to copy all of your data in some cases.
I ended up with the following:
Implement "small dynamic array" that can grow, but only up to some maximum capacity (e.g. 4096 words).
Implement an rbtree
Combine them together to make a "big hash map", where "small array" is used as a table and a bunch of rbtrees are used as buckets.
Use this hashmap as a base for a "big dynamic array", using indexes as keys
While the capacity is less than maximum capacity, the table grows according to the load factor. Once the capacity reached maximum, the table won't grow anymore, and new elements are just inserted into buckets. This structure in theory should work with O(log(N/k)) complexity.

Array VS single linked list VS double link list

I am learning about arrays, single linked list and double linked list now a days and this question came that
" What is the best option between these three data structures when it comes to fast searching, less memory, easily insertion and updating of things "
As far I know array cannot be the answer because it has fixed size. If we want to insert a new thing. it wouldn't always be possible. Double linked list can do the task but there will be two pointers needed for each node so there will be memory problem, so I think single linked list will fulfill all given requirements. Am I right? Please correct me if I am missing any point. There is also one more question that instead of choosing one of them, can I make combination of one or more data structures given here to meet all the requirements?
"What is the best option between these three data structures when it comes to fast searching, less memory, easily insertion and updating of things".
As far as I can tell Arrays serve the purpose.
Fast search: You could do binary search if array is sorted. You dont get that option in linkedlist
Less memory: Arrays will take least memory (but contiguous memory )
Insertion: Inserting in array is a matter of a[i] = "value". If array size is exceeded then simply export data into a new array. That is exactly how HashMaps / ArrayLists work under covers.
Updating things: Only Arrays provide you with Random access. a[i] ="new value".. updated in O(1) time if you know the index.
Each of those has its own benefits and downsides.
For search speed, I'd say arrays are better suitable due to the quick lookup times.
Since an array is a sequence of same-size elements, retrieving the value at an index is just memoryLocation + index * elementSize. For a linked list, the whole list needs traversing.
Arrays also win in the "less memory" category, since there's no need to store extra pointers.
For insertions, arrays are slow. You'll need to traverse the array, copy contents to a new array, assign the new array, delete the old one...
Insertions go much quicker in linked- or double lists, because it's just a matter of changing one or two pointers.
In the end, it all just depends on the use case. Are you inserting a lot? Then you probably want to consider a non-array structure.
Do you need many quick lookups? Consider those arrays again. Etc..
See also this question.
A linked list is usually the best choice when we don’t know in advance the number of elements we will have to store or the number can change dynamically.
Arrays have slow insertion and deletion times. To insert an element to the front or middle of the array, the first step is to ensure that there is space in the array for the new element, otherwise, the array needs to be RESIZED. This is an expensive operation. The next step is to open space for the new element by shifting every element after the desired index. Likewise, for deletion, shifting is required after removing an element. This implies that insertion time for arrays is Big O of n (O(n)) as n elements must be shifted.
Using static arrays, we can save some extra memory in
comparison to linked lists because we do not need to store pointers to the next node
a doubly-linked list support fast insertion/removal at their ends. This is used in LRU cache, where you need to enter new item to front and remove the oldest item from the end.

Is it more efficent to use a linked list and delete nodes or use an array and do a small computation to a string to see if element can be skipped?

I am writing a program in C that reads a file. Each line of the file is a string of characters to which a computation will be done. The result of the computation on a particular string may imply that strings latter on in the file do not need any computations done to them. Also if the reverse of the string comes in alphabetical order before the (current, non-reversed) string then it does not need to be checked.
My question is would it be better to put each string in a linked list and delete each node after finding particular strings don’t need to be checked or using an array and checking the last few characters of a string and if it is alphabetically after the string in the previous element skip it? Either way the list or array only needs to be iterated through once.
Rules of thumb is that if you are dealing with small objects (< 32 bytes), std::vector is better than a linked list for most of general operations.
But for larger objects, (say, 1K bytes), generally you need to consider lists.
There is an article details the comparison you can check , the link is here
http://www.baptiste-wicht.com/2012/11/cpp-benchmark-vector-vs-list/3/
Without further details about what are your needs is a bit difficult to tell you which one would fit more with your requirements.
Arrays are easy to access, specially if you are going to do it in a non sequential way, but they are hard to maintain if you need to perform deletions on it or if you don't have a good approximation of the final number of elements.
Lists are good if you plan to access them sequentially, but terrible if you need to jump between its elements. Also deletion over them can be done in constant time if you are already in the node you want to delete.
I don't quite understand how you plan to access them since you say that either one would be iterated just once, but if that is the case then either structure would give you the similar performance since you are not really taking advantage of their key benefits.
It's really difficult to understand what you are trying to do, but it sounds like you should create an array of records, with each record holding one of your strings and a boolean flag to indicate whether it should be processed.
You set each record's flag to true as you load the array from the file.
You use one pointer to scan the array once, processing only the strings from records whose flags are still true.
For each record processed, you use a second pointer to scan from the first pointer + 1 to the end of the array, identify strings that won't need processing (in light of the current string), and set their flags to false.
-Al.

Fast way to remove bytes from a buffer

Is there a faster way to do this:
Vector3* points = malloc(maxBufferCount*sizeof(Vector3));
//put content into the buffer and increment bufferCount
...
// remove one point at index `removeIndex`
bufferCount--;
for (int j=removeIndex; j<bufferCount; j++) {
points[j] = points[j+1];
}
I'm asking because I have a huge buffer from which I remove elements quite often.
No, sorry - removing elements from the middle of an array takes O(n) time. If you really want to modify the elements often (i. e. remove certain items and/or add others), use a linked list instead - that has constant-time removal and addition. In contrast, arrays have constant lookup time, while linked lists can be accessed (read) in linear time. So decide what you will do more frequently (reading or writing) and choose the appropriate data structure based upon that decision.
Note, however, that I (kindly) assumed you are not trying to commit the crime of premature optimization. If you haven't benchmarked that this is the bottleneck, then probably just don't worry about it.
Unless you know it's a bottleneck you can probably let the compiler optimize for you, but you could try memmove.
The selected answer here is pretty comprehensive: When to use strncpy or memmove?
A description is here: http://www.kernel.org/doc/man-pages/online/pages/man3/memmove.3.html
A few things to say. The memmove function will probably copy faster than you, often it is optimised by the writers of your particular complier to use special instructions which arent available in the C language without inline assembler. I believe these instructions are called SIMD instructions (Single Instruction Multiple Data)? Somebody correct me if I am wrong.
If you can save up items to be removed, then you can optimse by sorting the list of items you wish to remove and then, doing a single pass. It isnt hard but just takes some funny arithmetic.
Also you could just store each item in a linked list, removing an item is trivial, but you lose random acccess to your array.
Finally you can have an additional array of pointers, the same size of your array, each pointer pointing to an element. Then you can access the array through double indirection, you can sort the array by swapping pointers, and you can delete items by making their pointer NULL.
Hope this gives you some ideas. There usually is a way to optimise things, but then it becomes more application specific.

Why are linked lists faster than arrays?

I am very puzzled about this. Everywhere there is written "linked lists are faster than arrays" but no one makes the effort to say WHY. Using plain logic I can't understand how a linked list can be faster. In an array all cells are next to each other so as long as you know the size of each cell it's easy to reach one cell instantly. For example if there is a list of 10 integers and I want to get the value in the fourth cell then I just go directly to the start of the array+24 bytes and read 8 bytes from there.
In the other hand when you have a linked list and you want to get the element in the fourth place then you have to start from the beginning or end of the list(depending on if it's a single or double list) and go from one node to the other until you find what you're looking for.
So how the heck can going step by step be faster than going directly to an element?
This question title is misleading.
It asserts that linked lists are faster than arrays without limiting the scope well. There are a number of times when arrays can be significantly faster and there are a number of times when a linked list can be significantly faster: the particular case of linked lists "being faster" does not appear to be supported.
There are two things to consider:
The theoretical bounds of linked-lists vs. arrays in a particular operation; and
the real-world implementation and usage pattern including cache-locality and allocations.
As far as the access of an indexed element: The operation is O(1) in an array and as pointed out, is very fast (just an offset). The operation is O(k) in a linked list (where k is the index and may always be << n, depending) but if the linked list is already being traversed then this is O(1) per step which is "the same" as an array. If an array traversal (for(i=0;i<len;i++) is faster (or slower) depends upon particular implementation/language/run-time.
However, if there is a specific case where the array is not faster for either of the above operations (seek or traversal), it would be interesting to see to be dissected in more detail. (I am sure it is possible to find a language with a very degenerate implementation of arrays over lists cough Haskell cough)
Happy coding.
My simple usage summary: Arrays are good for indexed access and operations which involve swapping elements. The non-amortized re-size operation and extra slack (if required), however, may be rather costly. Linked lists amortize the re-sizing (and trade slack for a "pointer" per-cell) and can often excel at operations like "chopping out or inserting a bunch of elements". In the end they are different data-structures and should be treated as such.
Like most problems in programming, context is everything. You need to think about the expected access patterns of your data, and then design your storage system appropriately. If you insert something once, and then access it 1,000,000 times, then who cares what the insert cost is? On the other hand, if you insert/delete as often as you read, then those costs drive the decision.
Depends on which operation you are referring to. Adding or removing elements is a lot faster in a linked list than in an array.
Iterating sequentially over the list one by one is more or less the same speed in a linked list and an array.
Getting one specific element in the middle is a lot faster in an array.
And the array might waste space, because very often when expanding the array, more elements are allocated than needed at that point in time (think ArrayList in Java).
So you need to choose your data structure depending on what you want to do:
many insertions and iterating sequentially --> use a LinkedList
random access and ideally a predefined size --> use an array
Because no memory is moved when insertion is made in the middle of the array.
For the case you presented, its true - arrays are faster, you need arithmetic only to go from one element to another. Linked list require indirection and fragments memory.
The key is to know what structure to use and when.
Linked lists are preferable over arrays when:
a) you need constant-time insertions/deletions from the list (such as in real-time computing where time predictability is absolutely critical)
b) you don't know how many items will be in the list. With arrays, you may need to re-declare and copy memory if the array grows too big
c) you don't need random access to any elements
d) you want to be able to insert items in the middle of the list (such as a priority queue)
Arrays are preferable when:
a) you need indexed/random access to elements
b) you know the number of elements in the array ahead of time so that you can allocate the correct amount of memory for the array
c) you need speed when iterating through all the elements in sequence. You can use pointer math on the array to access each element, whereas you need to lookup the node based on the pointer for each element in linked list, which may result in page faults which may result in performance hits.
d) memory is a concern. Filled arrays take up less memory than linked lists. Each element in the array is just the data. Each linked list node requires the data as well as one (or more) pointers to the other elements in the linked list.
Array Lists (like those in .Net) give you the benefits of arrays, but dynamically allocate resources for you so that you don't need to worry too much about list size and you can delete items at any index without any effort or re-shuffling elements around. Performance-wise, arraylists are slower than raw arrays.
Reference:
Lamar answer
https://stackoverflow.com/a/393578/6249148
LinkedList is Node-based meaning that data is randomly placed in memory and is linked together by nodes (objects that point to another, rather than being next to one another)
Array is a set of similar data objects stored in sequential memory locations
The advantage of a linked list is that data doesn’t have to be sequential in memory. When you add/remove an element, you are simply changing the pointer of a node to point to a different node, not actually moving elements around. If you don’t have to add elements towards the end of the list, then accessing data is faster, due to iterating over less elements. However there are variations to the LinkedList such as a DoublyLinkedList which point to previous and next nodes.
The advantage of an array is that yes you can access any element O(1) time if you know the index, but if you don’t know the index, then you will have to iterate over the data.
The down side of an array is the fact that its data is stored sequentially in memory. If you want to insert an element at index 1, then you have to move every single element to the right. Also, the array has to keep resizing itself as it grows, basically copying itself in order to make a new array with a larger capacity. If you want to remove an element in the begging, then you will have to move all the elements to left.
Arrays are good when you know the index, but are costly as they grow.
The reason why people talk highly about linked lists is because the most useful and efficient data structures are node based.

Resources