Python collection of different sized arrays (Jagged arrays), Dask? - arrays

I have multiple 1-D numpy arrays of different size representing audio data.
Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.
I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]
Looking at this Dask picture, I was wondering if I can do what I want with Dask.
My attempt so far is this:
import numpy as np
import dask.array as da
np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))
# stack them
data = [[arr0],
[arr1]]
x = da.block(data)
x.compute()
# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])
Questions
Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?

I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:
import numpy as np
import awkward
np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>
varlen.sum()
# output: array([-0.7, 0.1])
The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."
So far, it seems to satisfies everything I need.

Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.
I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

Related

Scala - Efficient element wise sum of two arrays

I have two arrays which I would like to reduce to one array in which at each index you have the sum of the two elements in the original arrays. For example:
val arr1: Array[Int] = Array(1, 1, 3, 3, 5)
val arr1: Array[Int] = Array(2, 1, 2, 2, 1)
val arr3: Array[Int] = sum(arr1, arr2)
// This should result in:
// arr3 = Array(3, 2, 5, 5, 6)
I've seen this post: Element-wise sum of arrays in Scala, and I currently use this approach (zip/map). However, using this for a big data application I am concerned about its performance. Using this approach one has to traverse the array(s) at least twice. Is there a better approach in terms of efficiency?
The most efficient way might well be to do it lazily.
As with anything collection-oriented, Scala 2.12 and 2.13 are going to be different (this code is Scala 2.13, but 2.12 will be similar... might extend IndexedSeqLike, but I don't know for sure)
import scala.collection.IndexedSeq
import scala.math.Numeric
case class SumIndexedSeq[+T: Numeric](seq1: IndexedSeq[T], seq2: IndexedSeq[T]) extends IndexedSeq[T] {
override val length: Int = seq1.length.min(seq2.length)
override def apply(i: Int) =
if (i >= length) throw new IndexOutOfBoundsException
else seq1(i) + seq2(i)
}
Arrays are implicitly convertible to a subtype of collection.IndexedSeq. This will compute the sum of the corresponding elements on every access (which may be generally desirable as it's possible to use a mutable IndexedSeq).
If you need an Array, you can get one with only a single traversal via
val arr3: Array[Int] = SumIndexedSeq(arr1, arr2).toArray
but SumIndexedSeq can be used anywhere a Seq can be used without a traversal.
As a further optimization, especially if you're sure that the underlying collections/arrays won't mutate, you can add a cache so you don't add the same elements together twice. It can also be generalized, if you so care, to any binary operations on T (in which case the Numeric constraint can be removed).
As Luis noted, for a performance question: experiment and benchmark. It's worth keeping in mind that a cache implementation may well entail boxing every element to put in the cache, so you might need to be accessing the same elements many times in order for the cache to be a win (and a sufficiently large cache may have implications for the stability of a distributed system).
Well, first of all, as with all things related to performance the only answer is to benchmark.
Second, are you sure you need plain mutable, invariant, weird Arrays? Can't you use something like Vector or ArraySeq?
Third, you can just do something like this or using a while loop, which would be the same.
val result = ArraySeq.tabulate(math.min(arr1.length, arr2.length)) { i =>
arr1(i) + arr2(i)
}

How to get a sub-shape of an array in Python?

Not sure the title is correct, but I have an array with shape (84,84,3) and I need to get subset of this array with shape (84,84), excluding that third dimension.
How can I accomplish this with Python?
your_array[:,:,0]
This is called slicing. This particular example gets the first 'layer' of the array. This assumes your subshape is a single layer.
If you are using numpy arrays, using slices would be a standard way of doing it:
import numpy as np
n = 3 # or any other positive integer
a = np.empty((84, 84, n))
i = 0 # i in [0, n]
b = a[:, :, i]
print(b.shape)
I recommend you have a look at this.

Printing numpy array and dataframe list, any dependencies?

I am trying to print two different lists with numpy and pandas respectively.
The strange thing is that I can only print one list at a time by commenting the other one with all its accosiated code. Do mumpy and pandas have any dependcies?
import numpy as np
import pandas as pd
np.array = []
for i in range(7):
np.array.append([])
np.array[i] = i
values = np.array
print(np.power(np.array,3))
df = pd.DataFrame({'X':[78,85,96,80,86], 'Y':[84,94,89,83,86],'Z':[86,97,96,72,83]})
print(df)
I'm not sure what you mean by "I can only print one list at a time by commenting the other one with all its accosiated code", but any strange behavior you're seeing probably comes from you assigning to np.array. You should name your variable something different, e. g. array. Perhaps you were trying to do this:
arr = []
for i in range(7):
arr.append([])
arr[i] = i
values = np.array(arr)

iterating through two numpy arrays applying a function in Python

I have
import numpy as np
a = np.array([np.nan,2,3])
b = np.array([1,np.nan,2])
I want to apply a function to the a,b, is there a fast way of doing this. (like in Pandas, where we can do apply)
Specifically I am interesting in averaging a and b, but take the average to be one of the numbers when the other number is missing.
i.e. I want to return
np.array([1,2,2.5])
for the example above. However, I would like to know the answer to this in a more general setting (where I want to apply an operation element-wise to a number of numpy arrays)
You can use numpy.nanmean, which ignores NaNs:
np.nanmean([a, b], axis=0)
# array([ 1. , 2. , 2.5])
If you want to iterate some custom functions through NumPy arrays with the efficiency of NumPy's universal functions (ufunc), the choices are
Write your own C code
Use the ufuncify method of SymPy to generate code for you.
Here is an example of the latter, where the function is exp(x) + log(y) (since NumPy's ufuncs exp and log are already available, this is just for demonstration):
import numpy as np
import sympy as sym
from sympy.utilities.autowrap import ufuncify
x, y = sym.symbols('x y')
f = ufuncify([x, y], sym.exp(x) + sym.log(y))
Now applying f(np.array([1, 2, 3]), np.array([4, 5, 6])) will return NumPy array [4.10457619, 8.99849401, 21.87729639] in a way that's not a Python loop but a call to (by default) compiled Fortran code.
(But in practice, you are likely to find that NumPy already has some ufuncs that do what you want, if combined in a right way.)

Implementing Permutation of Complex Numbers In TensorFlow

In this associative lstm paper, http://arxiv.org/abs/1602.03032, they ask to permute a complex tensor.
They have provided their code here: https://github.com/mohammadpz/Associative_LSTM/blob/master/bricks.py#L79
I'm trying to replicate this in tensorflow. Here is what I have done:
# shape: C x F/2
# output = self.permutations: [num_copies x cell_size]
permutations = []
indices = numpy.arange(self._dim / 2) #[1 ,2 ,3 ...64]
for i in range(self._num_copies):
numpy.random.shuffle(indices) #[4, 48, 32, ...64]
permutations.append(numpy.concatenate(
[indices,
[ind + self._dim / 2 for ind in indices]]))
#you're appending a row with two columns -- a permutation in the first column, and the same permutation + dim/2 for imaginary
# C x F (numpy)
self.permutations = tf.constant(numpy.vstack(permutations), dtype = tf.int32) #This is a permutation tensor that has the stored permutations
# output = self.permutations: [num_copies x cell_size]
def permute(complex_tensor): #complex tensor is [batch_size x cell_size]
gather_tensor = tf.gather_nd(complex_tensor, self.permutations)
return gather_tensor
Basically, my question is: How efficiently can this be done in TensorFlow? Is there anyway to keep the batch size dimension fixed of complex tensor?
Also, is gather_nd the best way to go about this? Or is it better to do a for loop and iterate over each row in self.permutations using tf.gather?
def permute(self, complex_tensor):
inputs_permuted = []
for i in range(self.permutations.get_shape()[0].value):
inputs_permuted.append(
tf.gather(complex_tensor, self.permutations[i]))
return tf.concat(0, inputs_permuted)
I thought that gather_nd would be far more efficient.
Nevermind, I figured it out, the trick is to just use permute the original input tensor using tf transpose. This will allow you then to do a tf.gather on the entire matrix. Then you can tf concat the matrices together. Sorry if this wasted anyone's time.

Resources