Python 2.7: looping over 1D fibers in a multidimensional Numpy array - arrays

I am looking for a way to loop over 1D fibers (row, column, and multi-dimensional equivalents) along any dimension in a 3+-dimensional array.
In a 2D array this is fairly trivial since the fibers are rows and columns, so just saying for row in A gets the job done. But for 3D arrays for example, this expression iterates over 2D slices, not 1D fibers.
A working solution is the one below:
import numpy as np
A = np.arange(27).reshape((3,3,3))
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(A[fiber_index])
However, I am wondering whether there is something that is:
More idiomatic
Faster
Hope you can help!

I think you might be looking for numpy.apply_along_axis
In [10]: def my_func(x):
...: return x**2 + x
In [11]: np.apply_along_axis(my_func, 2, A)
Out[11]:
array([[[ 0, 2, 6],
[ 12, 20, 30],
[ 42, 56, 72]],
[[ 90, 110, 132],
[156, 182, 210],
[240, 272, 306]],
[[342, 380, 420],
[462, 506, 552],
[600, 650, 702]]])
Although many NumPy functions (including sum) have their own axis argument to specify which axis to use:
In [12]: np.sum(A, axis=2)
Out[12]:
array([[ 3, 12, 21],
[30, 39, 48],
[57, 66, 75]])

numpy provides a number of different ways of looping over 1 or more dimensions.
Your example:
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(fiber_index)
print A[fiber_index]
produces something like:
(0, 0)
[0 1 2]
(0, 1)
[3 4 5]
(0, 2)
[6 7 8]
...
generates all index combinations over the 1st 2 dim, giving your function the 1D fiber on the last.
Look at the code for ndindex. It's instructive. I tried to extract it's essence in https://stackoverflow.com/a/25097271/901925.
It uses as_strided to generate a dummy matrix over which an nditer iterate. It uses the 'multi_index' mode to generate an index set, rather than elements of that dummy. The iteration itself is done with a __next__ method. This is the same style of indexing that is currently used in numpy compiled code.
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html
Iterating Over Arrays has good explanation, including an example of doing so in cython.
Many functions, among them sum, max, product, let you specify which axis (axes) you want to iterate over. Your example, with sum, can be written as:
np.sum(A, axis=-1)
np.sum(A, axis=(1,2)) # sum over 2 axes
An equivalent is
np.add.reduce(A, axis=-1)
np.add is a ufunc, and reduce specifies an iteration mode. There are many other ufunc, and other iteration modes - accumulate, reduceat. You can also define your own ufunc.
xnx suggests
np.apply_along_axis(np.sum, 2, A)
It's worth digging through apply_along_axis to see how it steps through the dimensions of A. In your example, it steps over all possible i,j in a while loop, calculating:
outarr[(i,j)] = np.sum(A[(i, j, slice(None))])
Including slice objects in the indexing tuple is a nice trick. Note that it edits a list, and then converts it to a tuple for indexing. That's because tuples are immutable.
Your iteration can applied along any axis by rolling that axis to the end. This is a 'cheap' operation since it just changes the strides.
def with_ndindex(A, func, ax=-1):
# apply func along axis ax
A = np.rollaxis(A, ax, A.ndim) # roll ax to end (changes strides)
shape = A.shape[:-1]
B = np.empty(shape,dtype=A.dtype)
for ii in np.ndindex(shape):
B[ii] = func(A[ii])
return B
I did some timings on 3x3x3, 10x10x10 and 100x100x100 A arrays. This np.ndindex approach is consistently a third faster than the apply_along_axis approach. Direct use of np.sum(A, -1) is much faster.
So if func is limited to operating on a 1D fiber (unlike sum), then the ndindex approach is a good choice.

Related

How to get a sub-shape of an array in Python?

Not sure the title is correct, but I have an array with shape (84,84,3) and I need to get subset of this array with shape (84,84), excluding that third dimension.
How can I accomplish this with Python?
your_array[:,:,0]
This is called slicing. This particular example gets the first 'layer' of the array. This assumes your subshape is a single layer.
If you are using numpy arrays, using slices would be a standard way of doing it:
import numpy as np
n = 3 # or any other positive integer
a = np.empty((84, 84, n))
i = 0 # i in [0, n]
b = a[:, :, i]
print(b.shape)
I recommend you have a look at this.

Sort a Julia 1.1 matrix by one of its columns, that contains strings

As the title suggests, I need to sort the rows of a certain matrix by one of its columns, preferably in place if at all possible. Said column contains Strings (the array being of type Array{Union{Float64,String}}), and ideally the rows should end up in an alphabetial order, determined by this column. The line
sorted_rows = sort!(data, by = i -> data[i,2]),
where data is my matrix, produces the error ERROR: LoadError: UndefKeywordError: keyword argument dims not assigned. Specifying which part of the matrix I want sorted and adding the parameter dims=2 (which I assume is the dimension I want to sort along), namely
sorted_rows = sort!(data[2:end-1,:], by = i -> data[i,2],dims=2)
simply changes the error message to ERROR: LoadError: ArgumentError: invalid index: 01 Suurin yhteinen tekijä ja pienin yhteinen jaettava of type String. So the compiler is complainig about a string being an invalid index.
Any ideas on how this type of sorting cound be done? I should say that in this case the string in the column can be expected to start with a number, but I wouldn't mind finding a solution that works in the general case.
I'm using Julia 1.1.
You want sortslices, not sort — the latter just sorts all columns independently, whereas the former rearranges whole slices. Secondly, the by function doesn't take an index, it takes the value that is about to be compared (and allows you to transform it in some way). Thus:
julia> using Random
data = Union{Float64, String}[randn(100) [randstring(10) for _ in 1:100]]
100×2 Array{Union{Float64, String},2}:
0.211015 "6VPQbWU5f9"
-0.292298 "HgvHLkufqI"
1.74231 "zTCu1U5Vdl"
0.195822 "O3j43sbhKV"
⋮
-0.369007 "VzFH2OpWfU"
-1.30459 "6C68G64AWg"
-1.02434 "rldaQ3e0GE"
1.61653 "vjvn1SX3FW"
julia> sortslices(data, by=x->x[2], dims=1)
100×2 Array{Union{Float64, String},2}:
0.229143 "0syMQ7AFgQ"
-0.642065 "0wUew61bI5"
1.16888 "12PUn4V4gL"
-0.266574 "1Z2ONSBP04"
⋮
1.85761 "y2DDANcFCe"
1.53337 "yZju1uQqMM"
1.74231 "zTCu1U5Vdl"
0.974607 "zdiU0sVOZt"
Unfortunately we don't have an in-place sortslices! yet, but you can easily construct a sorted view with sortperm. This probably won't be as fast to use, but if you need the in-place-ness for semantic reasons it'll do just the trick.
julia> p = sortperm(data[:,2]);
julia> #view data[p, :]
100×2 view(::Array{Union{Float64, String},2}, [26, 45, 90, 87, 6, 96, 82, 75, 12, 27 … 53, 69, 100, 93, 36, 37, 39, 8, 3, 61], :) with eltype Union{Float64, String}:
0.229143 "0syMQ7AFgQ"
-0.642065 "0wUew61bI5"
1.16888 "12PUn4V4gL"
-0.266574 "1Z2ONSBP04"
⋮
1.85761 "y2DDANcFCe"
1.53337 "yZju1uQqMM"
1.74231 "zTCu1U5Vdl"
0.974607 "zdiU0sVOZt"
(If you want the in-place-ness for performance reasons, I'd recommend using a DataFrame or similar structure that holds its columns as independent homogenous vectors — a Union{Float64, String} will be slower than two separate well-typed vectors, and sort!ing a DataFrame works on whole rows like you want.)
you may want to look at SortingLab.jls fast string sort functions.
]add SortingLab
using SortingLab
idx = fsortperm(data[:,2])
new_data = data[idx]

Find edge points of numpy array for kmeans centroids initialization

I am working on implementing a kmeans algorithm in python.
I am testing out new ways of initializing my centroids and wanted to implement it and see what affect it would have on the cluster.
My idea is to select datapoints from my data set in a way that the centroids are initialized to edge points of my data.
Simple example 2 attribute example:
Lets say this is my input array
input = array([[3,3], [1,1], [-1,-1], [3,-3], [-1,1], [-3,3], [1,-1], [-3,-3]])
From this array I would like to select the edges points which would be [3,3] [-3,-3] [-3,3] [3,-3]. So if my k is 4, these points would be selected
In the data that I am working with has 4 and 9 attributes and around 300 data points in my data set
note: I have no found a solution to when k <> edge points but if k is > edge points I think I would select these 4 points and then try to place the rest of them around the center point of the graph
I have also thought about finding max and min for each column and from there try to find the edges of my data set but I don't have an idea of an effective way of identifying the edges from these values.
If you believe this idea will not work I would love to hear what you have to say.
Questions
Does numpy have such a function to get the indexes of data points on the edge of my data set?
If not, how would I go at finding these edge points in my data set?
Use scipy and pair-wise distances to find how farther each one is from another:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
Then, use sqaureform to get p vector into a matrix shape:
s=squareform(pdist(input))
Then, use numpy argwhere to find the indices where values are max or are extreme, and then look up those indices in the input array:
input[np.argwhere(s==np.max(p))]
array([[[ 3, 3],
[-3, -3]],
[[ 3, -3],
[-3, 3]],
[[-3, 3],
[ 3, -3]],
[[-3, -3],
[ 3, 3]]])
Complete code would be:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
s=squareform(p)
input[np.argwhere(s==np.max(p))]

Looping through slices of Theano tensor

I have two 2D Theano tensors, call them x_1 and x_2, and suppose for the sake of example, both x_1 and x_2 have shape (1, 50). Now, to compute their mean squared error, I simply run:
T.sqr(x_1 - x_2).mean(axis = -1).
However, what I wanted to do was construct a new tensor that consists of their mean squared error in chunks of 10. In other words, since I'm more familiar with NumPy, what I had in mind was to create the following tensor M in Theano:
M = [theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1) for i in xrange(0, 50, 10)]
Now, since Theano doesn't have for loops, but instead uses scan (which map is a special case of), I thought I would try the following:
sequence = T.arange(0, 50, 10)
M = theano.map(lambda i: theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1), sequence)
However, this does not seem to work, as I get the error:
only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Is there a way to loop through the slices using theano.scan (or map)? Thanks in advance, as I'm new to Theano!
Similar to what can be done in numpy, a solution would be to reshape your (1, 50) tensor to a (1, 10, 5) tensor (or even a (10, 5) tensor), and then to compute the mean along the second axis.
To illustrate this with numpy, suppose I want to compute means by slices of 2
x = np.array([0, 2, 0, 4, 0, 6])
x = x.reshape([3, 2])
np.mean(x, axis=1)
outputs
array([ 1., 2., 3.])

How to vectorize NumPy polyder function?

I would like to vectorize the NumPy function polyder, which computes derivatives of polynomials. Is there a simple way or a built-in function to do it?
With vectorize, I mean that if the input is an array of polynomials, the output would be the array with the derivative of the polynomials.
An example:
p = np.array([[3,4,5], [1,2]])
the output should be something like
np.array([[6, 4], [1]])
Since your subarrays, both input and output, can have different lengths, you are better off treating both as lists.
In [97]: [np.polyder(d) for d in [[3,4,5],[1,2]]]
Out[97]: [array([6, 4]), array([1])]
Your p is just a list in an expensive (timewise) array wrapper.
In [100]: p=np.array([[3,4,5],[1,2]])
In [101]: p
Out[101]: array([[3, 4, 5], [1, 2]], dtype=object)
There is little that you can do with such an array that you can't do just as well with a list. Do some time tests. You probably will find that iterating over the arrays of objects is slower than iteration over equivalent lists, especially if you take into account the time it takes convert a list to array.
It can also be tricky to create such arrays. If all the sublists are the same length the result will be a 2d array. Forcing them to be an object array takes special initiation.
A general rull of thumb is - if individual steps work with arrays or lists of different length, you probably can't vectorize. That is, you can't form a rectangular 2d array and apply vector operations.
If the polynomial lists were all the same length, then p could be 2d, and the result could also be that:
In [107]: p=np.array([[3,4,5],[0,1,2]])
In [108]: p
Out[108]:
array([[3, 4, 5],
[0, 1, 2]])
In [109]: np.array([np.polyder(i) for i in p])
Out[109]:
array([[6, 4],
[0, 1]])
In effect it is iterating over the rows of p, and then reassembling the result into an array. There are some numpy functions that streamline iteration (but don't speed it up much), but I see little need for those here.
Looking at the code of this function, the core is:
p = NX.asarray(p)
n = len(p) - 1
y = p[:-1] * NX.arange(n, 0, -1)
which for this 2d array, (len 3) is:
In [117]: p[:,:-1]*np.arange(2,0,-1)
Out[117]:
array([[6, 4],
[0, 1]])
So if the number of polynomials are all the same, this simple multiplication gives the 1st order derivative coefficients. And of course the rows can be padded so they are all the same. So 'vectorization' is easier than I initially thought.
import numpy as np
p = np.array([[3,4,5], [1,2]])
np.array([np.polyder(coefficients) for coefficients in p]) # array([[6 4], [1]], dtype=object)
would fulfill your interface for your specific example. But as hpaulj mentions, there's little sense in working with NumPy arrays instead of normal python lists here, and no actual (hardware-level) vectorization will happen. (Though, as with list comprehensions in general, the interpreter would be free to employ other means of parallelism to compute them.)

Resources