Haskell - Reproduce numpy's reshape - arrays

Getting into Haskell, I'm trying to reproduce something like numpy's reshape with lists. Specifically, given a flat list, reshape it into an n-dimensional list:
import numpy as np
a = np.arange(1, 18)
b = a.reshape([-1, 2, 3])
# b =
#
# array([[[ 1, 2, 3],
# [ 4, 5, 6]],
#
# [[ 7, 8, 9],
# [10, 11, 12]],
#
# [[13, 14, 15],
# [16, 17, 18]]])
I was able to reproduce the behaviour with fixed indices, e.g.:
*Main> reshape23 [1..18]
[[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]],[[13,14,15],[16,17,18]]]
My code is:
takeWithRemainder :: (Integral n) => n -> [a] -> ([a], [a])
takeWithRemainder _ [] = ([], [])
takeWithRemainder 0 xs = ([], xs)
takeWithRemainder n (x:xs) = (x : taken, remaining)
where (taken, remaining) = takeWithRemainder (n-1) xs
chunks :: (Integral n) => n -> [a] -> [[a]]
chunks _ [] = []
chunks chunkSize xs = chunk : chunks chunkSize remainderOfList
where (chunk, remainderOfList) = takeWithRemainder chunkSize xs
reshape23 = chunks 2 . chunks 3
Now, I can't seem to find a way to generalise this to an arbitrary shape. My original idea was doing a fold:
reshape :: (Integral n) => [n] -> [a] -> [b]
reshape ns list = foldr (\n acc -> (chunks n) . acc) id ns list
But, no matter how I go about it, I always get a type error from the compiler. From my understanding, the problem is that at some moment, the type for acc is inferred to be id's i.e. a -> a, and it doesn't like the fact that the list of functions in the fold all have a different (although compatible for composition) type signature. I run into the same problem trying to implement this with recursion myself instead of a fold.
This confused me because originally I had intended for the [b] in reshape's type signature to be a stand-in for "another, dissociated type" that could be anything from [[a]] to [[[[[a]]]]].
How am I going wrong about this? Is there a way to actually achieve the behaviour I intended, or is it just plain wrong to want this kind of "dynamic" behaviour in the first place?

There are two details here that are qualitatively different from Python, ultimately stemming from dynamic vs. static typing.
The first one you have noticed yourself: at each chunking step the resulting type is different from the input type. This means you cannot use foldr, because it expects a function of one specific type. You could do it via recursion though.
The second problem is a bit less obvious: the return type of your reshape function depends on what the first argument is. Like, if the first argument is [2], the return type is [[a]], but if the first argument is [2, 3], then the return type is [[[a]]]. In Haskell, all types must be known at compile time. And this means that your reshape function cannot take the first argument that is defined at runtime. In other words, the first argument must be at the type level.
Type-level values may be computed via type functions (aka "type families"), but because it's not just the type (i.e. you also have a value to compute), the natural (or the only?) mechanism for that is a type class.
So, first let's define our type class:
class Reshape (dimensions :: [Nat]) from to | dimensions from -> to where
reshape :: from -> to
The class has three parameters: dimensions of kind [Nat] is a type-level array of numbers, representing the desired dimensions. from is the argument type, and to is the result type. Note that, even though it is known that the argument type is always [a], we have to have it as a type variable here, because otherwise our class instances won't be able to correctly match the same a between argument and result.
Plus, the class has a functional dependency dimensions from -> to to indicate that if I know both dimensions and from, I can unambiguously determine to.
Next, the base case: when dimentions is an empty list, the function just degrades to id:
instance Reshape '[] [a] [a] where
reshape = id
And now the meat: the recursive case.
instance (KnownNat n, Reshape tail [a] [b]) => Reshape (n:tail) [a] [[b]] where
reshape = chunksOf n . reshape #tail
where n = fromInteger . natVal $ Proxy #n
First it makes the recursive call reshape #tail to chunk out the previous dimension, and then it chunks out the result of that using the value of the current dimension as chunk size.
Note also that I'm using the chunksOf function from the library split. No need to redefine it yourself.
Let's test it out:
λ reshape # '[1] [1,2,3]
[[1],[2],[3]]
λ reshape # '[1,2] [1,2,3,4]
[[[1,2]],[[3,4]]]
λ reshape # '[2,3] [1..12]
[[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]]
λ reshape # '[2,3,4] [1..24]
[[[[1,2,3,4],[5,6,7,8],[9,10,11,12]],[[13,14,15,16],[17,18,19,20],[21,22,23,24]]]]
For reference, here's the full program with all imports and extensions:
{-# LANGUAGE
MultiParamTypeClasses, FunctionalDependencies, TypeApplications,
ScopedTypeVariables, DataKinds, TypeOperators, KindSignatures,
FlexibleInstances, FlexibleContexts, UndecidableInstances,
AllowAmbiguousTypes
#-}
import Data.Proxy (Proxy(..))
import Data.List.Split (chunksOf)
import GHC.TypeLits (Nat, KnownNat, natVal)
class Reshape (dimensions :: [Nat]) from to | dimensions from -> to where
reshape :: from -> to
instance Reshape '[] [a] [a] where
reshape = id
instance (KnownNat n, Reshape tail [a] [b]) => Reshape (n:tail) [a] [[b]] where
reshape = chunksOf n . reshape #tail
where n = fromInteger . natVal $ Proxy #n

#Fyodor Soikin's answer is perfect with respect to the actual question. Except there is a bit of a problem with the question itself. Lists of lists is not the same thing as an array. It is a common misconception that Haskell doesn't have arrays and you are forced to deal with lists, which could not be further from the truth.
Because the question is tagged with array and there is comparison to numpy, I would like to add a proper answer that handles this situation for multidimensional arrays. There are a couple of array libraries in Haskell ecosystem, one of which is massiv
A reshape like functionality from numpy can be achieved by resize' function:
λ> 1 ... (18 :: Int)
Array D Seq (Sz1 18)
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18 ]
λ> resize' (Sz (3 :> 2 :. 3)) (1 ... (18 :: Int))
Array D Seq (Sz (3 :> 2 :. 3))
[ [ [ 1, 2, 3 ]
, [ 4, 5, 6 ]
]
, [ [ 7, 8, 9 ]
, [ 10, 11, 12 ]
]
, [ [ 13, 14, 15 ]
, [ 16, 17, 18 ]
]
]

Related

How can I create and push to a shared or distributed array of arrays?

I have written Julia code in which I initialize an empty array as follows:
a = []
Later in the code, I simply push to this array as follows:
push![a, b]
where b = [c, d, e, ...] is another array, and each b can be of different length.
This works just fine in un-parallelized code. However, I want to do the same thing in parallelized code where a = [] is a shared or distributed array that the different processors can push to.
Neither SharedArray or DArray worked for me. Any advice?
Firstly you should always need to declare what are you holding in your array [] means Any[] and it is almost never a good idea.
Let us consider this vector with placeholders:
julia> a=[Int[] for _ in 1:8]
8-element Vector{Vector{Int64}}:
[]
[]
[]
[]
[]
[]
[]
[]
This Vector contains 8 references to other Vectors.
Let us now distribute it:
julia> using Distributed; addprocs(4);
julia> #everywhere using DistributedArrays
julia> b = distribute(a)
8-element DArray{Vector{Int64}, 1, Vector{Vector{Int64}}}:
[]
[]
[]
[]
[]
[]
[]
[]
This new b is now available through all worker processes where each worker holds its localpart of it. Let us mutate it!
julia> fetch(#spawnat 2 append!(localpart(b)[1], [1,2,3,4]));
julia> fetch(#spawnat 3 append!(localpart(b)[2], [10,20]));
julia> fetch(#spawnat 3 push!(localpart(b)[2], 30))
3-element Vector{Int64}:
10
20
30
We can see that everything is working as expected (we have used fetch to make sure our code actually got executed on remote workers).
Let us know check on the master process the state of b:
julia> b
8-element DArray{Vector{Int64}, 1, Vector{Vector{Int64}}}:
[1, 2, 3, 4]
[]
[]
[10, 20, 30]
[]
[]
[]
[]
You can see that we have successfully used remote workers to mutate b.
I asked a similar question here. I originally followed Prezmyslaw's answer but could not get distribute to distribute an already existing array the way I thought it would for Julia 1.7. What worked for me was defining the array as it was initialized :
using Distributed
addprocs(4)
#everywhere using DistributedArrays
a = distribute([[] for _ in procs()])
#sync #distributed for i = 1:10
b = fill(i, 5)
append!(localpart(a)[1], b) # I swapped push! for append!
end
a
What this does is: first it creates an array with subarrays that are distributed to each worker, then it distributes computation and fills the corresponding subarrays with the values calculated on each worker, finally it merges the subarrays to obtain a full array with all the values.
It is interesting to compare this with the exact same code but substituting a = distribute([[] for _ in procs()]) for a = [[] for _ in procs()]; distribute(a). Evidently the latter does not work as expected (at least for Julia 1.7).

Python '==' operator gives wrong result

I am comparing two elements of a numpy array. The memory address obtained by id() function for both elements are different. Also the is operator gives out that the two elements are not same.
However if I compare memory address of the two array elements using == operator it gives out that the two elements are same.
I am not able to understand how the == operator gives output as True when the two memory address are different.
Below is my code.
import numpy as np
a = np.arange(8)
newarray = a[np.array([3,4,2])]
print("Initial array : ", a)
print("New array : ", newarray)
# comparison of two element using 'is' operator
print("\ncomparison using is operator : ",a[3] is newarray[0])
# comparison of memory address of two element using '==' operator
print("comparison using == opertor : ", id(a[3]) == id(newarray[0]))
# memory address of both elements of array
print("\nMemory address of a : ", id(a[3]))
print("Memory address of newarray : ", id(newarray[0]))
Output:
Initial array : [0 1 2 3 4 5 6 7]
New array : [3 4 2]
comparison using is operator : False
comparison using == operator : True
Memory address of a : 2807046101296
Memory address of newarray : 2808566470576
This is probably due to a combination of Python's integer caching and obscure implemetation details of numpy.
If you slightly change the code you will see that the ids are not consistent during the flow of the code, but they are actually the same on each line:
import numpy as np
a = np.arange(8)
newarray = a[np.array([3,4,2])]
print(id(a[3]), id(newarray[0]))
print(id(a[3]), id(newarray[0]))
outputs
276651376 276651376
20168608 20168608
A numpy array does not store references to objects like a list (unless it is object dtype). It has a 1d databuffer with the numeric values, which it may access in various ways.
In [17]: a = np.arange(8)
...: newarray = a[np.array([3,4,2])]
In [18]: a
Out[18]: array([0, 1, 2, 3, 4, 5, 6, 7])
In [21]: newarray
Out[21]: array([3, 4, 2])
newarray, produced with advanced indexing is not a view. It has its own databuffer and values.
Let's 'unbox' elements of these arrays, assigning them to variables.
In [22]: x = a[3]; y = newarray[0]
In [23]: x
Out[23]: 3
In [24]: y
Out[24]: 3
In [25]: id(x),id(y)
Out[25]: (139768142922768, 139768142925584)
id are different (the assignment prevents the possibly confusing recycling of ids).
id are different, so is is False:
In [26]: x is y
Out[26]: False
but values are the same (by == test)
In [27]: x == y
Out[27]: True
Another 'unboxing', different id:
In [28]: w = a[3]
In [29]: w
Out[29]: 3
In [30]: id(w)
Out[30]: 139768133495504
These integers are actually np.int64 objects. Python does 'cache' small integers, but that does not apply here.
In [33]: type(x)
Out[33]: numpy.int64
Where can see "where" the arrays store their data:
In [31]: a.__array_interface__['data']
Out[31]: (33696480, False)
In [32]: newarray.__array_interface__['data']
Out[32]: (33838848, False)
These are totally different buffers. If newarray was a view the buffer pointers would be the same or nearby.
If we don't hang on to the indexed object, ids may be reused:
In [34]: id(newarray[0]), id(newarray[0])
Out[34]: (139768133493520, 139768133493520)
In general is and id are not useful when working with numpy arrays.

How to remove all occurrences of an element from NumPy array? [duplicate]

This question already has answers here:
Deleting certain elements from numpy array using conditional checks
(3 answers)
Remove all occurrences of a value from a list?
(26 answers)
Closed 4 years ago.
The title is pretty self-explanatory: I have an numpy array like (let's say ints)
[ 1 2 10 2 12 2 ] and I would like to remove all occurrences of 2, so that the resulting array is [ 1 10 12 ]. Preferably I would like to do this as fastest as possible, because I am using relatively large arrays.
NumPy has a function called numpy.delete() but it takes the indexes as an argument, which I do not have.
Edit: The question is indeed different from Deleting certain elements from numpy array using conditional checks, which is I guess a more "general" case. However, the idea of removing occurrences from an array is fundamental enough to merit its own explicit question, so I am keeping the question.
You can use indexing:
arr = np.array([1, 2, 10, 2, 12, 2])
print(arr[arr != 2])
# [ 1 10 12]
Timing is pretty good:
from timeit import Timer
arr = np.array(range(5000))
print(min(Timer(lambda: arr[arr != 4999]).repeat(500, 500)))
# 0.004942436999999522
you can use another numpy function.It is numpy.setdiff1d(ar1, ar2, assume_unique=False).
This function Finds the set difference of two arrays.
import numpy as np
a = np.array([1, 2, 10, 2,12, 2])
b = np.array([2])
c = np.setdiff1d(a,b,True)
print(c)
There are several ways to do this. I suggest you use a mask:
import numpy as np
a = np.array([ 1, 2 ,10, 2, 12, 2 ])
a[~np.isin(a, 2)]
>> array([ 1, 10, 12])
np.isin is convenient because you can apply the filter to multiple elements at once if you need to:
a[~np.isin(a, (1,2))]
>> array([ 10, 12])
Also note that a[mask] is a slice of the original array. This is memory efficient; but if you need to create a new array with your filtered values and leave the original ones untouched, use .copy, e.g.:
b = a[~np.isin(a, (1,2))].copy()

Julia Approach to python equivalent list of lists

I just started tinkering with Julia and I'm really getting to like it. However, I am running into a road block. For example, in Python (although not very efficient or pythonic), I would create an empty list and append a list of a known size and type, and then convert to a NumPy array:
Python Snippet
a = []
for ....
a.append([1.,2.,3.,4.])
b = numpy.array(a)
I want to be able to do something similar in Julia, but I can't seem to figure it out. This is what I have so far:
Julia snippet
a = Array{Float64}[]
for .....
push!(a,[1.,2.,3.,4.])
end
The result is an n-element Array{Array{Float64,N},1} of size (n,), but I would like it to be an nx4 Array{Float64,2}.
Any suggestions or better way of doing this?
The literal translation of your code would be
# Building up as rows
a = [1. 2. 3. 4.]
for i in 1:3
a = vcat(a, [1. 2. 3. 4.])
end
# Building up as columns
b = [1.,2.,3.,4.]
for i in 1:3
b = hcat(b, [1.,2.,3.,4.])
end
But this isn't a natural pattern in Julia, you'd do something like
A = zeros(4,4)
for i in 1:4, j in 1:4
A[i,j] = j
end
or even
A = Float64[j for i in 1:4, j in 1:4]
Basically allocating all the memory at once.
Does this do what you want?
julia> a = Array{Float64}[]
0-element Array{Array{Float64,N},1}
julia> for i=1:3
push!(a,[1.,2.,3.,4.])
end
julia> a
3-element Array{Array{Float64,N},1}:
[1.0,2.0,3.0,4.0]
[1.0,2.0,3.0,4.0]
[1.0,2.0,3.0,4.0]
julia> b = hcat(a...)'
3x4 Array{Float64,2}:
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
It seems to match the python output:
In [9]: a = []
In [10]: for i in range(3):
a.append([1, 2, 3, 4])
....:
In [11]: b = numpy.array(a); b
Out[11]:
array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
I should add that this is probably not what you actually want to be doing as the hcat(a...)' can be expensive if a has many elements. Is there a reason not to use a 2d array from the beginning? Perhaps more context to the question (i.e. the code you are actually trying to write) would help.
The other answers don't work if the number of loop iterations isn't known in advance, or assume that the underlying arrays being merged are one-dimensional. It seems Julia lacks a built-in function for "take this list of N-D arrays and return me a new (N+1)-D array".
Julia requires a different concatenation solution depending on the dimension of the underlying data. So, for example, if the underlying elements of a are vectors, one can use hcat(a) or cat(a,dims=2). But, if a is e.g a 2D array, one must use cat(a,dims=3), etc. The dims argument to cat is not optional, and there is no default value to indicate "the last dimension".
Here is a helper function that mimics the np.array functionality for this use case. (I called it collapse instead of array, because it doesn't behave quite the same way as np.array)
function collapse(x)
return cat(x...,dims=length(size(x[1]))+1)
end
One would use this as
a = []
for ...
... compute new_a...
push!(a,new_a)
end
a = collapse(a)

Python 2.7: looping over 1D fibers in a multidimensional Numpy array

I am looking for a way to loop over 1D fibers (row, column, and multi-dimensional equivalents) along any dimension in a 3+-dimensional array.
In a 2D array this is fairly trivial since the fibers are rows and columns, so just saying for row in A gets the job done. But for 3D arrays for example, this expression iterates over 2D slices, not 1D fibers.
A working solution is the one below:
import numpy as np
A = np.arange(27).reshape((3,3,3))
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(A[fiber_index])
However, I am wondering whether there is something that is:
More idiomatic
Faster
Hope you can help!
I think you might be looking for numpy.apply_along_axis
In [10]: def my_func(x):
...: return x**2 + x
In [11]: np.apply_along_axis(my_func, 2, A)
Out[11]:
array([[[ 0, 2, 6],
[ 12, 20, 30],
[ 42, 56, 72]],
[[ 90, 110, 132],
[156, 182, 210],
[240, 272, 306]],
[[342, 380, 420],
[462, 506, 552],
[600, 650, 702]]])
Although many NumPy functions (including sum) have their own axis argument to specify which axis to use:
In [12]: np.sum(A, axis=2)
Out[12]:
array([[ 3, 12, 21],
[30, 39, 48],
[57, 66, 75]])
numpy provides a number of different ways of looping over 1 or more dimensions.
Your example:
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(fiber_index)
print A[fiber_index]
produces something like:
(0, 0)
[0 1 2]
(0, 1)
[3 4 5]
(0, 2)
[6 7 8]
...
generates all index combinations over the 1st 2 dim, giving your function the 1D fiber on the last.
Look at the code for ndindex. It's instructive. I tried to extract it's essence in https://stackoverflow.com/a/25097271/901925.
It uses as_strided to generate a dummy matrix over which an nditer iterate. It uses the 'multi_index' mode to generate an index set, rather than elements of that dummy. The iteration itself is done with a __next__ method. This is the same style of indexing that is currently used in numpy compiled code.
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html
Iterating Over Arrays has good explanation, including an example of doing so in cython.
Many functions, among them sum, max, product, let you specify which axis (axes) you want to iterate over. Your example, with sum, can be written as:
np.sum(A, axis=-1)
np.sum(A, axis=(1,2)) # sum over 2 axes
An equivalent is
np.add.reduce(A, axis=-1)
np.add is a ufunc, and reduce specifies an iteration mode. There are many other ufunc, and other iteration modes - accumulate, reduceat. You can also define your own ufunc.
xnx suggests
np.apply_along_axis(np.sum, 2, A)
It's worth digging through apply_along_axis to see how it steps through the dimensions of A. In your example, it steps over all possible i,j in a while loop, calculating:
outarr[(i,j)] = np.sum(A[(i, j, slice(None))])
Including slice objects in the indexing tuple is a nice trick. Note that it edits a list, and then converts it to a tuple for indexing. That's because tuples are immutable.
Your iteration can applied along any axis by rolling that axis to the end. This is a 'cheap' operation since it just changes the strides.
def with_ndindex(A, func, ax=-1):
# apply func along axis ax
A = np.rollaxis(A, ax, A.ndim) # roll ax to end (changes strides)
shape = A.shape[:-1]
B = np.empty(shape,dtype=A.dtype)
for ii in np.ndindex(shape):
B[ii] = func(A[ii])
return B
I did some timings on 3x3x3, 10x10x10 and 100x100x100 A arrays. This np.ndindex approach is consistently a third faster than the apply_along_axis approach. Direct use of np.sum(A, -1) is much faster.
So if func is limited to operating on a 1D fiber (unlike sum), then the ndindex approach is a good choice.

Resources