Accessing properties of objects in a numpy array - arrays

I've got a numpy array of custom objects. How can I get a new array containing the values of specific attributes of those objects?
Example:
import numpy as np
class Pos():
def __init__(self, x, y):
self.x = x
self.y = y
arr = np.array( [ Pos(0,1), Pos(2,3), Pos(4,5) ] )
# Magic line
xy_arr = .... # arr[ [arr.x,arr.y] ]
print xy_arr
# array([[0,1],
[2,3],
[4,5]])
I should add that my motives for such an operation is to calculate the centre of mass of the objects in the array.

Usually, when I have multiple quantities that belong together and I want to benefit from numpys indexing power I use record arrays. Beware, if you do a lot of append/remove operations, numpy might be rather ineffective in terms of speed.
If I understood your comment correctly, this is an example where two values are selected by a third:
import numpy as np
# create a table for your data
dt = np.dtype([('A', np.double), ('x', np.double), ('y', np.double)])
table = np.array([(1,1,1), (2,2,2), (3,3,3)], dtype=dt)
# define a selection mask
selection = table['A'] > 1.5
columns = ['x', 'y']
print table[selection][columns]
A nice side effect is that saving this table using h5py is very simple and convenient as your data is already labeled.

Related

Numpy arrays inside a pandas dataframe - how to normalize the values, keeping the original structure?

I have arrays as cells in a dataframe. The arrays are of 2 columns, a value and a category, and their lengths, meaning the amount of rows, differ.
Here's a simple example of the situation with just one column:
import pandas as pd
import numpy as np
arr1 = np.array([[1, 2,3], ['a','b','c']])
arr2 = np.array([[2, 3], ['a','b']])
df1 = pd.DataFrame(index=np.arange(0, 2), columns=(['column1']))
df1.iloc[0][0]=arr1
df1.iloc[1][0]=arr2
Resulting df1 to be
0 [[1, 2, 3], [a, b, c]]
1 [[2, 3], [a, b]]
What I want are column-widely normalized values as new columns inside arrays arr1 and arr2, so in this case using [1,2,3,2,3], not just [1,2,3] and [2,3] separately. How can I achieve this? The structure of the dataframe df1 must not change, only what is inside the cells.
Extrating the values to a list and then normalizing them is an easy task, but how to "put them back" is where I struggle because of the complex structure. Should I add an index to all the values inside the arrays to pair them up? Sound slow and unnecessary. Could I somehow create an array of reference to the original numeric objects and replace those? But if do, I would lose the original values... but how would I add them as a new column because I am only referencing the original objects?
I am sure there is an intuitive way of doing this but I just can't articulate it.

How to get a sub-shape of an array in Python?

Not sure the title is correct, but I have an array with shape (84,84,3) and I need to get subset of this array with shape (84,84), excluding that third dimension.
How can I accomplish this with Python?
your_array[:,:,0]
This is called slicing. This particular example gets the first 'layer' of the array. This assumes your subshape is a single layer.
If you are using numpy arrays, using slices would be a standard way of doing it:
import numpy as np
n = 3 # or any other positive integer
a = np.empty((84, 84, n))
i = 0 # i in [0, n]
b = a[:, :, i]
print(b.shape)
I recommend you have a look at this.

Python collection of different sized arrays (Jagged arrays), Dask?

I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?
I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.
Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

Numpy: invalid literal for int() with base 10

How do I convert a Python array into a NumPy array, retaining the mixed datatypes, but replacing the tuples (parentheses) with square brackets instead? You will notice that the first 3 columns start off as int, float, float and the last column is a string. But in Block 3, all of them become strings!
Below is my output:
[(29606, 30.120779 , -97.309574 , 'DPCS')
(29606, 30.2312951 , -97.6918021 , 'DPCS')
(29606, 30.1682102 , -97.6160325 , 'DPCS')
(40880, 40.56634232, -83.10456486, 'RN')
(40880, 40.58765221, -83.14444627, 'RN')
(40880, 40.58286847, -83.12839945, 'RN')]
Block 2
[[29606, 30.120779, -97.309574, 'DPCS'], [29606, 30.2312951, -97.6918021, 'DPCS'], [29606, 30.1682102, -97.6160325, 'DPCS'], [40880, 40.5663423172498, -83.1045648601189, 'RN'], [40880, 40.5876522144065, -83.1444462730164, 'RN'], [40880, 40.5828684683826, -83.1283994529175, 'RN']]
Block 3
[['29606' '30.120779' '-97.309574' 'DPCS']
['29606' '30.2312951' '-97.6918021' 'DPCS']
['29606' '30.1682102' '-97.6160325' 'DPCS']
['40880' '40.5663423172498' '-83.1045648601189' 'RN']
['40880' '40.5876522144065' '-83.1444462730164' 'RN']
['40880' '40.5828684683826' '-83.1283994529175' 'RN']]
Process finished with exit code 0
The above comes from code:
import numpy
import pandas
from geopy.distance import great_circle
import utility_functions as uf
import timeit
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
import numpy_indexed as npi
# normalization thresholds
DISTANCE_LOWER_THRESH = 0
DISTANCE_UPPER_THRESH = 50
#class for scoring and updating the matrix of scores between workers (rows) and patients (columns).
class WorkerPatientScores:
def __init__(self, dist_weight=1):
self.a = []
self.a = ([(29606, 30.120779, -97.309574, 'DPCS'),
(29606, 30.2312951, -97.6918021, 'DPCS'),
(29606, 30.1682102, -97.6160325, 'DPCS'),
(40880, 40.5663423172498, -83.1045648601189, 'RN'),
(40880, 40.5876522144065, -83.1444462730164, 'RN'),
(40880, 40.5828684683826, -83.1283994529175, 'RN')])
dt = numpy.dtype('int, float, float, object') # datatypes
ndarray = numpy.array(self.a, dtype=dt)
print(ndarray)
ndarray2 = [[i[0], i[1], i[2], i[3]] for i in ndarray]
print("Block 2")
print(ndarray2)
# Below removes previous datatypes
ndarray3 = numpy.array(ndarray2)
print("Block 3")
print(ndarray3)
When I instead change the above LOC to:
ndarray3 = numpy.array(ndarray2, dtype=dt)
I get the error:
ValueError: invalid literal for int() with base 10: 'DPCS'
ndarray is a valid structured array with 4 fields.
ndarray2 (misnamed) is a list of lists. You iterate on the elements (rows) of ndarray, and for each extract the field elements.
ndarray3 uses the common format, the string.
Note that self.a is a list of tuples. That's critical when creating a structured array.
alist = [(i[0], i[1], i[2], i[3]) for i in ndarray]
np.array(alist, dtype=dt)
should work. alist is a list of tuples.
ndarray.tolist() also produces that list of tuples.
np.array(..., object) works with either a list of lists or list of tuples.
Object dtype arrays have their place, but aren't processed in the same way as structured arrays, nor in the same way as numeric arrays. Each has their place.
I figured this out!
ndarray3 = numpy.array(ndarray2, dtype=object)

Implementing Permutation of Complex Numbers In TensorFlow

In this associative lstm paper, http://arxiv.org/abs/1602.03032, they ask to permute a complex tensor.
They have provided their code here: https://github.com/mohammadpz/Associative_LSTM/blob/master/bricks.py#L79
I'm trying to replicate this in tensorflow. Here is what I have done:
# shape: C x F/2
# output = self.permutations: [num_copies x cell_size]
permutations = []
indices = numpy.arange(self._dim / 2) #[1 ,2 ,3 ...64]
for i in range(self._num_copies):
numpy.random.shuffle(indices) #[4, 48, 32, ...64]
permutations.append(numpy.concatenate(
[indices,
[ind + self._dim / 2 for ind in indices]]))
#you're appending a row with two columns -- a permutation in the first column, and the same permutation + dim/2 for imaginary
# C x F (numpy)
self.permutations = tf.constant(numpy.vstack(permutations), dtype = tf.int32) #This is a permutation tensor that has the stored permutations
# output = self.permutations: [num_copies x cell_size]
def permute(complex_tensor): #complex tensor is [batch_size x cell_size]
gather_tensor = tf.gather_nd(complex_tensor, self.permutations)
return gather_tensor
Basically, my question is: How efficiently can this be done in TensorFlow? Is there anyway to keep the batch size dimension fixed of complex tensor?
Also, is gather_nd the best way to go about this? Or is it better to do a for loop and iterate over each row in self.permutations using tf.gather?
def permute(self, complex_tensor):
inputs_permuted = []
for i in range(self.permutations.get_shape()[0].value):
inputs_permuted.append(
tf.gather(complex_tensor, self.permutations[i]))
return tf.concat(0, inputs_permuted)
I thought that gather_nd would be far more efficient.
Nevermind, I figured it out, the trick is to just use permute the original input tensor using tf transpose. This will allow you then to do a tf.gather on the entire matrix. Then you can tf concat the matrices together. Sorry if this wasted anyone's time.

Resources