Numpy concatenate is slow: any alternative approach? - arrays

I am running the following code:
for i in range(1000)
My_Array=numpy.concatenate((My_Array,New_Rows[i]), axis=0)
The above code is slow. Is there any faster approach?

This is basically what is happening in all algorithms based on arrays.
Each time you change the size of the array, it needs to be resized and every element needs to be copied. This is happening here too. (some implementations reserve some empty slots; e.g. doubling space of internal memory with each growing).
If you got your data at np.array creation-time, just add these all at once (memory will allocated only once then!)
If not, collect them with something like a linked list (allowing O(1) appending-operations). Then read it in your np.array at once (again only one memory allocation).
This is not much of a numpy-specific topic, but much more about data-strucures.
Edit: as this quite vague answer got some upvotes, i feel the need to make clear that my linked-list approach is one possible example. As indicated in the comment, python's lists are more array-like (and definitely not linked-lists). But the core-fact is: list.append() in python is fast (amortized: O(1)) while that's not true for numpy-arrays! There is also a small part about the internals in the docs:
How are lists implemented?
Python’s lists are really variable-length arrays, not Lisp-style linked lists. The implementation uses a contiguous array of references to other objects, and keeps a pointer to this array and the array’s length in a list head structure.
This makes indexing a list a[i] an operation whose cost is independent of the size of the list or the value of the index.
When items are appended or inserted, the array of references is resized. Some cleverness is applied to improve the performance of appending items repeatedly; when the array must be grown, some extra space is allocated so the next few times don’t require an actual resize.
(bold annotations by me)

Maybe creating an empty array with the correct size and than populating it?
if you have a list of arrays with same dimensions you could
import numpy as np
arr = np.zeros((len(l),)+l[0].shape)
for i, v in enumerate(l):
arr[i] = v
works much faster for me, it only requires one memory allocation

It depends on what New_Rows[i] is, and what kind of array do you want. If you start with lists (or 1d arrays) that you want to join end to end (to make a long 1d array) just concatenate them all at once. Concatenate takes a list of any length, not just 2 items.
np.concatenate(New_Rows, axis=0)
or maybe use an intermediate list comprehension (for more flexibility)
np.concatenate([row for row in New_Rows])
or closer to your example.
np.concatenate([New_Rows[i] for i in range(1000)])
But if New_Rows elements are all the same length, and you want a 2d array, one New_Rows value per row, np.array does a nice job:
np.array(New_Rows)
np.array([i for i in New_Rows])
np.array([New_Rows[i] for i in range(1000)])
np.array is designed primarily to build an array from a list of lists.
np.concatenate can also build in 2d, but the inputs need to be 2d to start with. vstack and stack can take care of that. But all those stack functions use some sort of list comprehension followed by concatenate.
In general it is better/faster to iterate or append with lists, and apply the np.array (or concatenate) just once. appending to a list is fast; much faster than making a new array.

I think #thebeancounter 's solution is the way to go.
If you do not know the exact size of your numpy array ahead of time, you can also take an approach similar to how vector class is implemented in C++.
To be more specific, you can wrap the numpy ndarray into a new class which has a default size which is larger than your current needs. When the numpy array is almost fully populated, copy the current array to a larger one.

Assume you have a large list of 2D numpy arrays, with the same number of columns and different number of rows like this :
x = [numpy_array1(r_1, c),......,numpy_arrayN(r_n, c)]
concatenate like this:
while len(x) != 1:
if len(x) == 2:
x = np.concatenate((x[0], x[1]))
break
for i in range(0, len(x), 2):
if (i+1) == len(x):
x[0] = np.concatenate((x[0], x[i]))
else:
x[i] = np.concatenate((x[i], x[i+1]))
x = x[::2]

Related

How to blit from a 1D array along a dimension of a 2D array?

I have a 2D array, and have computed necessary updates along a given dimension of it using a 1D array (said updates can't be computed in place as earlier calculations would override values needed in later calculations). I thus want to copy the updates into my 2D array. The most obvious way to do this would, at first glance, appear to be to use Array slicing and Array.blit.
I have tried the approach of extracting the relevant dimension using array slicing, and then blitting across to that, but that doesn't update the values inside the 2D array. I think what is happening is that a new, separate, 1D array is being created when I make the slice, and the values are being blitted into that new array, which of course is dropped a moment later when it goes back out of scope.
I suppose you could say that I was expecting the slicing to return a view into the 2D array which would work for the blit function call, but instead the slicing actually returns a new array with the values copied into it (which, thinking about it, is what slicing does otherwise, I believe).
Currently I am using a workaround whereby I create a 2D array, where one of the dimensions is only 1 element wide (thus effectively re-creating a 1D array), and then using Array2D.blit. I would prefer to do it directly though, both because I find this ugly, and moreover because it would be quite useful elsewhere in my program where I can't just declare a 1D array as 2D.
My first approach:
let srcArray = Array.zeroCreate srcArrayLength
... // do relevant computation
srcArray.[index] <- result
... // finish computation
Array.blit srcArray 0 destArray.[index, *] 0 srcArrayLength
My current approach:
let srcArray = Array2D.zeroCreate 1 srcArrayLength
... // do relevant computation
srcArray.[0,index] <- result
... // finish computation
Array2D.blit srcArray 0 0 destArray index 0 1 srcArrayLength
The former approach has no effect on my destination 2D array. The latter approach works where I use it, but as I said above it isn't nice, and cannot be used in another situation, where I have a jagged 2D array (i.e. 'a[][]) that I would like to blit across from.
How might I go about achieiving my aim? I thought of Span/Memory, but it wasn't clear to me if and how they could be used here. Alternatively, if you can spot a better way to do this that doesn't involve blit, I'm all-virtual-ears.
I figured out a fairly good solution to this, with the help of someone over in the F# Foundation Slack. Since nobody else has posted an answer, I'll put this one up.
Both Array.Copy (note that that is the .NET Array.Copy method, not the F#-specific Array.copy) and Buffer.BlockCopy were suggested to me. Array.Copy still complains about mismatching array types, but Buffer.BlockCopy ignores the dimensionality of the supplied array, and merely copies the specified number of bytes from one location to another. Using this and relying on the fact that 2D arrays are really stored as 1D arrays in row-major order (the same as C, I believe), it is quite possible to overwrite the last dimension of a multi-dimensional array reasonably cleanly.
I updated the code from the 'current approach' in my question to the below:
let srcArray = Array.zeroCreate srcArrayLength
... //do relevant computation
srcArray.[index] <- result
... //finish computation
Buffer.BlockCopy(srcArray, 0, destArray, firstDimIndex * lengthOfSecondDim * sizeof<'a>, lengthOfSecondDim * sizeof<'a>
Not only does it do the job in a way which I personally find a bit tidier, but it has a side-benefit in that it is noticeably faster than the second approach described in the question - I haven't yet run a benchmark to quantify the difference though.

Array VS single linked list VS double link list

I am learning about arrays, single linked list and double linked list now a days and this question came that
" What is the best option between these three data structures when it comes to fast searching, less memory, easily insertion and updating of things "
As far I know array cannot be the answer because it has fixed size. If we want to insert a new thing. it wouldn't always be possible. Double linked list can do the task but there will be two pointers needed for each node so there will be memory problem, so I think single linked list will fulfill all given requirements. Am I right? Please correct me if I am missing any point. There is also one more question that instead of choosing one of them, can I make combination of one or more data structures given here to meet all the requirements?
"What is the best option between these three data structures when it comes to fast searching, less memory, easily insertion and updating of things".
As far as I can tell Arrays serve the purpose.
Fast search: You could do binary search if array is sorted. You dont get that option in linkedlist
Less memory: Arrays will take least memory (but contiguous memory )
Insertion: Inserting in array is a matter of a[i] = "value". If array size is exceeded then simply export data into a new array. That is exactly how HashMaps / ArrayLists work under covers.
Updating things: Only Arrays provide you with Random access. a[i] ="new value".. updated in O(1) time if you know the index.
Each of those has its own benefits and downsides.
For search speed, I'd say arrays are better suitable due to the quick lookup times.
Since an array is a sequence of same-size elements, retrieving the value at an index is just memoryLocation + index * elementSize. For a linked list, the whole list needs traversing.
Arrays also win in the "less memory" category, since there's no need to store extra pointers.
For insertions, arrays are slow. You'll need to traverse the array, copy contents to a new array, assign the new array, delete the old one...
Insertions go much quicker in linked- or double lists, because it's just a matter of changing one or two pointers.
In the end, it all just depends on the use case. Are you inserting a lot? Then you probably want to consider a non-array structure.
Do you need many quick lookups? Consider those arrays again. Etc..
See also this question.
A linked list is usually the best choice when we don’t know in advance the number of elements we will have to store or the number can change dynamically.
Arrays have slow insertion and deletion times. To insert an element to the front or middle of the array, the first step is to ensure that there is space in the array for the new element, otherwise, the array needs to be RESIZED. This is an expensive operation. The next step is to open space for the new element by shifting every element after the desired index. Likewise, for deletion, shifting is required after removing an element. This implies that insertion time for arrays is Big O of n (O(n)) as n elements must be shifted.
Using static arrays, we can save some extra memory in
comparison to linked lists because we do not need to store pointers to the next node
a doubly-linked list support fast insertion/removal at their ends. This is used in LRU cache, where you need to enter new item to front and remove the oldest item from the end.

Push_back element in Matlab Struct

In Matlab, I have a simple structure and I would like to build an array of this structure (I know how to do this). My question: is there a way to simply insert an element to that array without having to tell the array in wich position it should be? Does something similar to the "push_back" function in c++ ,that simply puts your element at the end of the vector, exists in the Matlab language?
You can use indexing in conjunction with end
a_struct = struct('x', 1);
a_struct(end+1) = struct('x', 2); % this writes the element to the `end+1`'th-position
disp(a_struct)
Will give you:
1x2 struct array with fields:
x
Note though, that under the hood there's no preallocation whatsoever as there might be for c++ vectors etc.
So every assignment to end+1 will internally result in making a copy of the old structure with one additional element.
See e.g. http://blogs.mathworks.com/loren/2008/02/01/structure-initialization/#7 for comments on this.
It sounds like you want to iteratively extend the array (vector). This is very inefficient in MATLAB as it will lead to a large number of reallocations as the vector grows.
In MATLAB, it is better to allocate the vector in advance (of the correct size) and index it directly, or use arrayfun to construct the array.
This is exactly the same issue as in c++'s std::vector, where it is much better to allocate once and then use std::back_inserter compared to push_back().

Why are linked lists faster than arrays?

I am very puzzled about this. Everywhere there is written "linked lists are faster than arrays" but no one makes the effort to say WHY. Using plain logic I can't understand how a linked list can be faster. In an array all cells are next to each other so as long as you know the size of each cell it's easy to reach one cell instantly. For example if there is a list of 10 integers and I want to get the value in the fourth cell then I just go directly to the start of the array+24 bytes and read 8 bytes from there.
In the other hand when you have a linked list and you want to get the element in the fourth place then you have to start from the beginning or end of the list(depending on if it's a single or double list) and go from one node to the other until you find what you're looking for.
So how the heck can going step by step be faster than going directly to an element?
This question title is misleading.
It asserts that linked lists are faster than arrays without limiting the scope well. There are a number of times when arrays can be significantly faster and there are a number of times when a linked list can be significantly faster: the particular case of linked lists "being faster" does not appear to be supported.
There are two things to consider:
The theoretical bounds of linked-lists vs. arrays in a particular operation; and
the real-world implementation and usage pattern including cache-locality and allocations.
As far as the access of an indexed element: The operation is O(1) in an array and as pointed out, is very fast (just an offset). The operation is O(k) in a linked list (where k is the index and may always be << n, depending) but if the linked list is already being traversed then this is O(1) per step which is "the same" as an array. If an array traversal (for(i=0;i<len;i++) is faster (or slower) depends upon particular implementation/language/run-time.
However, if there is a specific case where the array is not faster for either of the above operations (seek or traversal), it would be interesting to see to be dissected in more detail. (I am sure it is possible to find a language with a very degenerate implementation of arrays over lists cough Haskell cough)
Happy coding.
My simple usage summary: Arrays are good for indexed access and operations which involve swapping elements. The non-amortized re-size operation and extra slack (if required), however, may be rather costly. Linked lists amortize the re-sizing (and trade slack for a "pointer" per-cell) and can often excel at operations like "chopping out or inserting a bunch of elements". In the end they are different data-structures and should be treated as such.
Like most problems in programming, context is everything. You need to think about the expected access patterns of your data, and then design your storage system appropriately. If you insert something once, and then access it 1,000,000 times, then who cares what the insert cost is? On the other hand, if you insert/delete as often as you read, then those costs drive the decision.
Depends on which operation you are referring to. Adding or removing elements is a lot faster in a linked list than in an array.
Iterating sequentially over the list one by one is more or less the same speed in a linked list and an array.
Getting one specific element in the middle is a lot faster in an array.
And the array might waste space, because very often when expanding the array, more elements are allocated than needed at that point in time (think ArrayList in Java).
So you need to choose your data structure depending on what you want to do:
many insertions and iterating sequentially --> use a LinkedList
random access and ideally a predefined size --> use an array
Because no memory is moved when insertion is made in the middle of the array.
For the case you presented, its true - arrays are faster, you need arithmetic only to go from one element to another. Linked list require indirection and fragments memory.
The key is to know what structure to use and when.
Linked lists are preferable over arrays when:
a) you need constant-time insertions/deletions from the list (such as in real-time computing where time predictability is absolutely critical)
b) you don't know how many items will be in the list. With arrays, you may need to re-declare and copy memory if the array grows too big
c) you don't need random access to any elements
d) you want to be able to insert items in the middle of the list (such as a priority queue)
Arrays are preferable when:
a) you need indexed/random access to elements
b) you know the number of elements in the array ahead of time so that you can allocate the correct amount of memory for the array
c) you need speed when iterating through all the elements in sequence. You can use pointer math on the array to access each element, whereas you need to lookup the node based on the pointer for each element in linked list, which may result in page faults which may result in performance hits.
d) memory is a concern. Filled arrays take up less memory than linked lists. Each element in the array is just the data. Each linked list node requires the data as well as one (or more) pointers to the other elements in the linked list.
Array Lists (like those in .Net) give you the benefits of arrays, but dynamically allocate resources for you so that you don't need to worry too much about list size and you can delete items at any index without any effort or re-shuffling elements around. Performance-wise, arraylists are slower than raw arrays.
Reference:
Lamar answer
https://stackoverflow.com/a/393578/6249148
LinkedList is Node-based meaning that data is randomly placed in memory and is linked together by nodes (objects that point to another, rather than being next to one another)
Array is a set of similar data objects stored in sequential memory locations
The advantage of a linked list is that data doesn’t have to be sequential in memory. When you add/remove an element, you are simply changing the pointer of a node to point to a different node, not actually moving elements around. If you don’t have to add elements towards the end of the list, then accessing data is faster, due to iterating over less elements. However there are variations to the LinkedList such as a DoublyLinkedList which point to previous and next nodes.
The advantage of an array is that yes you can access any element O(1) time if you know the index, but if you don’t know the index, then you will have to iterate over the data.
The down side of an array is the fact that its data is stored sequentially in memory. If you want to insert an element at index 1, then you have to move every single element to the right. Also, the array has to keep resizing itself as it grows, basically copying itself in order to make a new array with a larger capacity. If you want to remove an element in the begging, then you will have to move all the elements to left.
Arrays are good when you know the index, but are costly as they grow.
The reason why people talk highly about linked lists is because the most useful and efficient data structures are node based.

How can I efficiently copy 2-dimensional arrays of bytes into a larger 2D array?

I have a structure called Patch that represents a 2D array of data.
newtype Size = (Int, Int)
data Patch = Patch Size Strict.ByteString
I want to construct a larger Patch from a set of smaller Patches and their assigned positions. (The Patches do not overlap.) The function looks like this:
newtype Position = (Int, Int)
combinePatches :: [(Position, Patch)] -> Patch
combinePatches plan = undefined
I see two sub-problems. First, I must define a function to translate 2D array copies into a set of 1D array copies. Second, I must construct the final Patch from all those copies.
Note that the final Patch will be around 4 MB of data. This is why I want to avoid a naive approach.
I'm fairly confident that I could do this horribly inefficiently, but I would like some advice on how to efficiently manipulate large 2D arrays in Haskell. I have been looking at the "vector" library, but I have never used it before.
Thanks for your time.
If the spec is really just a one-time creation of a new Patch from a set of previous ones and their positions, then this is a straightforward single-pass algorithm. Conceptually, I'd think of it as two steps -- first, combine the existing patches into a data structure with reasonable lookup for any give position. Next, write your new structure lazily by querying the compound structure. This should be roughly O(n log(m)) -- n being the size of the new array you're writing, and m being the number of patches.
This is conceptually much simpler if you use the Vector library instead of a raw ByteString. But it is simpler still if you simply use Data.Array.Unboxed. If you need arrays that can interop with C, then use Data.Array.Storable instead.
If you ditch purity, at least locally, and work with an ST array, you should be able to trivially do this in O(n) time. Of course, the constant factors will still be worse than using fast copying of chunks of memory at a time, but there's no way to keep that code from looking low-level.

Resources