Awkward Array: Fancy indexing with boolean mask along named axis - arrays

I have a dataset of 2D audio data. These audio fragments differ in length, hence I'm using Awkward Array. Through a Boolean mask, I want to only return the parts containing speech.
Table mask attempt
import numpy as np
import awkward as aw
awk = aw.fromiter([{"ch0": np.array([0, 1, 2]), "ch1": np.array([3, 4, 5])},
{"ch0": np.array([6, .7]), "ch1": np.array([8, 9])}])
# [{'ch0': [0.0, 1.0, 2.0], 'ch1': [3, 4, 5]},
# {'ch0': [6.0, 0.7], 'ch1': [8, 9]}]
awk_mask = aw.fromiter([{"op": np.array([False, True, False]), "cl": np.array([True, True, False])},
{"op": np.array([True, True]), "cl": np.array([True, False])}])
# [{'cl': [True, True, False], 'op': [False, True, False]},
# {'cl': [True, False], 'op': [True, True]}]
awk[awk_mask]
# TypeError: cannot interpret dtype [('cl', 'O'), ('op', 'O')] as a fancy index or mask
It seems that a Table cannot be used for fancy indexing.
Array mask attempts
Numpy equivalent
nparr = np.arange(0,6).reshape((2, -1))
# array([[0, 1, 2],
# [3, 4, 5]])
npmask = np.array([True, False, True])
nparr[:, npmask]
# array([[0, 2],
# [3, 5]])
Table version attempt; failed
awk[:, npmask]
# NotImplementedError: multidimensional index through a Table (TODO: needed for [0, n) -> [0, m) -> "table" -> ...)
Seems multidimensional selection is not implemented yet.
JaggedArray - Numpy mask version; works
jarr = aw.fromiter(nparr)
# <JaggedArray [[0 1 2] [3 4 5]] at 0x..>
jarr[:npmask]
# array([[0, 2],
# [3, 5]])
JaggedArray - JaggedArray mask version; works
jmask = aw.fromiter(npmask)
# array([ True, False, True])
jarr[:, jmask]
# array([[0, 2],
# [3, 5]])
Questions
How to do efficient boolean mask selection with Table or with named dimensions (like xarray)?
Will multidimensional selection in Table be implemented in awkward-array, or only in awkward-1.0?
Library versions
print("numpy version : ", np.__version__) # numpy version : 1.17.3
print("pandas version : ", pd.__version__) # pandas version : 0.25.3
print("awkward version : ", aw.__version__) # awkward version : 0.12.14

This is not with named array dimensions, but with only JaggedArrays, masked selection is possible:
jarr_2d = aw.fromiter([[np.array([0, 1, 2]), np.array([3, 4, 5])],
[np.array([6, 7]), np.array([8, 9])]])
# <JaggedArray [[[0 1 2] [3 4 5]] [[6 7] [8 9]]] at 0x7fc9c7c4e750>
jarr_2d_mask = aw.fromiter([[np.array([False, True, False]), np.array([True, True, False])],
[np.array([True, True]), np.array([True, False])]])
# <JaggedArray [[[False True False] [True True False]] [[True True] [True False]]] at 0x7fc9c7c1e590>
jarr_2d[jarr_2d_mask]
# <JaggedArray [[[1] [3 4]] [[6 7] [8]]] at 0x7fc9c7c5b690>
Not sure if this code is efficient? Especially compared to fancy indexing with only Numpy arrays?

Related

Confirm all columns in a pandas dataframe are 1-D

It is not good practise to include multi-arrays/lists as columns in a pandas dataframe. In the event that I want to raise a value error whenever any column in a dataframe is not 1-D.
Given a dataset
dfA = pd.DataFrame(
np.array(
[
[1, (0,2), 0, 3],
[1, (0,0), 1, 2],
[0, (5,1),6, 1],
[4, (3,0), 3, 4],
[1, (1,1), 0, 2],
[2, (0,1), 3, 5],
[1, (3,3), 1, 2],
[6, (4,3), 5, 3],
[3, (0,2), 1, 2],
[2, (0,0), 2, 1],
]
),
columns=['A', 'B', 'C', 'D'])
I want to do something similar to
if columns in dfA are not all 1-D:
raise ValueError("Dataframe must only have 1-D columns")
In your case you can slice the 1st row , then np.shape
dfA.iloc[0].map(lambda x :np.shape(x))!=()
Out[413]:
A False
B True
C False
D False
Name: 0, dtype: bool

NumPy: indexing array by list of tuples - how to do it correctly?

I am in the following situation - I have the following:
Multidimensional numpy array a of n dimensions
t, an array of k rows (tuples), each with n elements. In other words, each row in this array is an index in a
What I want: from a, return an array b with k scalar elements, the ith element in b being the result of indexing a with the ith tuple from t.
Seems trivial enough. The following approach, however, does not work
def get(a, t):
# wrong result + takes way too long
return a[t]
I have to resort to doing this iteratively i.e. the following works correctly:
def get(a, t):
res = []
for ind in t:
a_scalar = a
for i in ind:
a_scalar = a_scalar[i]
# a_scalar is now a scalar
res.append(a_scalar)
return res
This works, except for the fact that given that each dimension in a has over 30 elements, the procedure does get really slow when n gets to more than 5. I understand that it would be slow regardless, however, I would like to exploit numpy's capabilities as I believe it would speed up this process considerably.
The key to getting this right is to understand the roles of indexing lists and tuples. Often the two are treated the same, but in numpy indexing, tuples, list and arrays convey different information.
In [1]: a = np.arange(12).reshape(3,4)
In [2]: t = np.array([(0,0),(1,1),(2,2)])
In [4]: a
Out[4]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [5]: t
Out[5]:
array([[0, 0],
[1, 1],
[2, 2]])
You tried:
In [6]: a[t]
Out[6]:
array([[[ 0, 1, 2, 3],
[ 0, 1, 2, 3]],
[[ 4, 5, 6, 7],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[ 8, 9, 10, 11]]])
So what's wrong with it? It ran, but selected a (3,2) array of rows of a. That is, it applied t to just the first dimension, effectively a[t, :]. You want to index on all dimensions, some sort of a[t1, t2]. That's the same as a[(t1,t2)] - a tuple of indices.
In [10]: a[tuple(t[0])] # a[(0,0)]
Out[10]: 0
In [11]: a[tuple(t[1])] # a[(1,1)]
Out[11]: 5
In [12]: a[tuple(t[2])]
Out[12]: 10
or doing all at once:
In [13]: a[(t[:,0], t[:,1])]
Out[13]: array([ 0, 5, 10])
Another way to write it, is n lists (or arrays), one for each dimension:
In [14]: a[[0,1,2],[0,1,2]]
Out[14]: array([ 0, 5, 10])
In [18]: tuple(t.T)
Out[18]: (array([0, 1, 2]), array([0, 1, 2]))
In [19]: a[tuple(t.T)]
Out[19]: array([ 0, 5, 10])
More generally, in a[idx1, idx2] array idx1 is broadcast against idx2 to produce a full selection array. Here the 2 arrays are 1d and match, the selection is your t set of pairs. But the same principle applies to selecting a set of rows and columns, a[ [[0],[2]], [0,2,3] ].
Using the ideas in [10] and following, your get could be sped up with:
In [20]: def get(a, t):
...: res = []
...: for ind in t:
...: res.append(a[tuple(ind)]) # index all dimensions at once
...: return res
...:
In [21]: get(a,t)
Out[21]: [0, 5, 10]
If t really was a list of tuples (as opposed to an array built from them), your get could be:
In [23]: tl = [(0,0),(1,1),(2,2)]
In [24]: [a[ind] for ind in tl]
Out[24]: [0, 5, 10]
Explore using np.ravel_multi_index
Create some test data
arr = np.arange(10**4)
arr.shape=10,10,10,10
t = []
for j in range(5):
t.append( tuple(np.random.randint(10, size = 4)))
print(t)
# [(1, 8, 2, 0),
# (2, 3, 3, 6),
# (1, 4, 8, 5),
# (2, 2, 6, 3),
# (0, 5, 0, 2),]
ta = np.array(t).T
print(ta)
# array([[1, 2, 1, 2, 0],
# [8, 3, 4, 2, 5],
# [2, 3, 8, 6, 0],
# [0, 6, 5, 3, 2]])
arr.ravel()[np.ravel_multi_index(tuple(ta), (10,10,10,10))]
# array([1820, 2336, 1485, 2263, 502]
np.ravel_multi_index basically calculates, from the tuple of input arrays, the index into a flattened array that starts with shape (in this case) (10, 10, 10, 10).
Does this do what you need? Is it fast enough?

How to create a mask for nd values in a 2d NumPy array?

If I want to create a mask depending on a 1d value in a 2d array:
a = np.array([[3, 5], [7, 1]])
threshold = 2
mask = a > threshold
print(a)
print(mask)
I get:
[[3 5]
[7 2]]
[[ True True]
[ True False]]
How can I create such a mask for a 2d array with nd values? Like the following example of 2d values and a 2d threshold in a 2d array:
b = np.array([[[1, 5], [3, 5]], [[4, 4], [7, 2]]])
threshold = 2, 4
print(b)
Looks like this:
[[[1 5]
[3 5]]
[[4 4]
[7 2]]]
[1, 5], [3, 5], [4, 4] and [7, 2] are the exemplary 2d values. The threshold, as set in threshold, for the first value is 2 and for the second value it's 4:
cell for [1, 5] should be False since 1 > 2 == False and 5 > 4 == True
cell for [3, 5] should be True since 3 > 2 == True and 5 > 4 == True
cell for [4, 4] should be False since 4 > 2 == True and 4 > 4 == False
cell for [7, 2] should be False since 7 > 2 == True and 2 > 4 == False
What do I have to do to get this corresponding mask?
[[ False True]
[ False False]]
numpy broadcasted comparison actually handles this quite nicely for you. Just make your threshold a 1D array and assert all along the final axis.
t = np.array([2, 4])
(b > t).all(-1)
array([[False, True],
[False, False]])
To clarify however, your array is actually 3D. If your array was 2D, like below, this would work a bit differently:
arr = np.array([[1, 5],
[3, 5],
[4, 4],
[7, 2]])
(arr > t).all(-1)
array([False, True, False, False])

Numpy 2-D array boolean masking

I don't understand one example in this numpy tutorial.
a = np.arange(12).reshape(3,4)
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
Then why will a[b1,b2] return array([4, 10])? Shouldn't it return array([[4, 6], [8, 10]])?
Any detailed explanation is appreciated!
When you index an array with multiple arrays, it indexes with pairs of elements from the indexing arrays
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b1
array([False, True, True], dtype=bool)
>>> b2
array([ True, False, True, False], dtype=bool)
>>> a[b1, b2]
array([ 4, 10])
Notice that this is equivalent to:
>>> a[(1, 2), (0, 2)]
array([ 4, 10])
which are the elements at a[1, 0] and a[2, 2]
>>> a[1, 0]
4
>>> a[2, 2]
10
Because of this pairwise behavior, you cannot in general index with separate length arrays (they have to be able to broadcast). So this example is sort of an accident since both indexing arrays have two indices where they are True; if one had three True values for example, you'd get an error:
>>> b3 = np.array([True, True, True, False])
>>> a[b1, b3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (3,)
So this is specifically letting you know that the indexing arrays must be able to be broadcast together (so that it can chip off indices together in a smart way; e.g. if one indexing array just had a single value, that would be repeated with each value from the other indexing array).
To get the results you expect, you could index the result separately:
>>> a[b1][:, b2]
array([[ 4, 6],
[ 8, 10]])
Otherwise, you could also turn your index array into a 2D array with the same shape as a, but note that if you do that the result will be a linear array (since any number of elements could be pulled out, which of course might not be square):
>>> a[np.outer(b1, b2)]
array([ 4, 6, 8, 10])
The indices of true for the first array are
>>> i = np.where(b1)
>>> i
array([1,2])
For the second array they are
>>> j = np.where(b2)
>>> j
array([0,1])
Using these index masks together,
>>> a[i,j]
array([4, 10])
Another way to apply a general boolean 2D mask on a 2D numpy array is the following:
Use matrix element-wise multiplication:
import numpy as np
n = 100
mask = np.identity(n)
data = np.random.rand(n,n)
data_masked = data * mask
In this random example, you are keeping only the elements on the diagonal. The mask could be any n by n matrix though.

Converting uneven rows to columns with FasterCSV

I have a CSV data file with rows that may have lots of columns 500+ and some with a lot less. I need to transpose it so that each row becomes a column in the output file. The problem is that the rows in the original file may not all have the same number of columns so when I try the transpose method of array I get:
`transpose': element size differs (12 should be 5) (IndexError)
Is there an alternative to transpose that works with uneven array length?
I would insert nulls to fill the holes in your matrix, something such as:
a = [[1, 2, 3], [3, 4]]
# This would throw the error you're talking about
# a.transpose
# Largest row
size = a.max { |r1, r2| r1.size <=> r2.size }.size
# Enlarge matrix inserting nils as needed
a.each { |r| r[size - 1] ||= nil }
# So now a == [[1, 2, 3], [3, 4, nil]]
aa = a.transpose
# aa == [[1, 3], [2, 4], [3, nil]]
# Intitial CSV table data
csv_data = [ [1,2,3,4,5], [10,20,30,40], [100,200] ]
# Finding max length of rows
row_length = csv_data.map(&:length).max
# Inserting nil to the end of each row
csv_data.map do |row|
(row_length - row.length).times { row.insert(-1, nil) }
end
# Let's check
csv_data
# => [[1, 2, 3, 4, 5], [10, 20, 30, 40, nil], [100, 200, nil, nil, nil]]
# Transposing...
transposed_csv_data = csv_data.transpose
# Hooray!
# => [[1, 10, 100], [2, 20, 200], [3, 30, nil], [4, 40, nil], [5, nil, nil]]

Resources