Efficiently sorting and filtering a JaggedArray by another one - awkward-array

I have a JaggedArray (awkward.array.jagged.JaggedArray) that contains indices that point to positions in another JaggedArray. Both arrays have the same length, but each of the numpy.ndarrays that the JaggedArrays contain can be of different length. I would like to sort the second array using the indices of the first array, at the same time dropping the elements from the second array that are not indexed from the first array. The first array can additionally contain values of -1 (could also be replaced by None if needed, but this is currently not that case) that mean that there is no match in the second array. In such a case, the corresponding position in the first array should be set to a default value (e.g. 0).
Here's a practical example and how I solve this at the moment:
import uproot
import numpy as np
import awkward
def good_index(my_indices, my_values):
my_list = []
for index in my_indices:
if index > -1:
return my_list
indices = awkward.fromiter([[0, -1], [3,1,-1], [-1,0,-1]])
values = awkward.fromiter([[1.1, 1.2, 1.3], [2.1,2.2,2.3,2.4], [3.1]])
new_map = awkward.fromiter(map(good_index, indices, values))
The resulting new_map is: [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]].
Is there a more efficient/faster way achieving this? I was thinking that one could use numpy functionality such as numpy.where, but due to the different lengths of the ndarrays this fails at least for the ways that I tried.

If all of the subarrays in values are guaranteed to be non-empty (so that indexing with -1 returns the last subelement, not an error), then you can do this:
>>> almost = values[indices] # almost what you want; uses -1 as a real index
>>> almost.content = awkward.MaskedArray(indices.content < 0, almost.content)
>>> almost.fillna(0.0)
<JaggedArray [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]] at 0x7fe54c713c88>
The last step is optional because without it, the missing elements are None, rather than 0.0.
If some of the subarrays in values are empty, you can pad them to ensure they have at least one subelement. All of the original subelements are indexed the same way they were before, since pad only increases the length, if need be.
>>> values = awkward.fromiter([[1.1, 1.2, 1.3], [], [2.1, 2.2, 2.3, 2.4], [], [3.1]])
>>> values.pad(1)
<JaggedArray [[1.1 1.2 1.3] [None] [2.1 2.2 2.3 2.4] [None] [3.1]] at 0x7fe54c713978>


Tensorflow-probability transform event shape of JointDistribution

I would like to create a distribution for n categorical variables C_1,.., C_n whose event shape is n. Using JointDistributionSequentialAutoBatched the event dimension is a list [[],..,[]]. For example for n=2
import tensorflow_probability.python.distributions as tfd
probs = [
[0.8, 0.2], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
D = tfd.JointDistributionSequentialAutoBatched([tfd.Categorical(probs=p) for p in probs])
>>> D
<tfp.distributions.JointDistributionSequentialAutoBatched 'JointDistributionSequentialAutoBatched' batch_shape=[] event_shape=[[], []] dtype=[int32, int32]>
How do I reshape it to get event shape [2]?
A few different approaches could work here:
Create a batch of Categorical distributions and then use tfd.Independent to reinterpret the batch dimension as the event:
vector_dist = tfd.Independent(
[0.8, 0.2, 0.0], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
Here I added an extra zero to pad out probs so that both distributions can be represented by a single Categorical object.
Use the Blockwise distribution, which stuffs its component distributions into a single vector (as opposed to the JointDistribution classes, which return them as separate values):
vector_dist = tfd.Blockwise([tfd.Categorical(probs=p) for p in probs])
The closest to a direct answer to your question is to apply the Split bijector, whose inverse is Concat, to the joint distribution:
tfb = tfp.bijectors
D = tfd.JointDistributionSequentialAutoBatched(
[tfd.Categorical(probs=[p] for p in probs])
vector_dist = tfb.Invert(tfb.Split(2))(D)
Note that I had to awkwardly write probs=[p] instead of just probs=p. This is because the Concat bijector, like tf.concat, can't change the tensor rank of its argument---it can concatenate small vectors into a big vector, but not scalars into a vector---so we have to ensure that its inputs are themselves vectors. This could be avoided if TFP had a Stack bijector analogous to tf.stack / tf.unstack (it doesn't currently, but there's no reason this couldn't exist).

Saving to an empty array of arrays from nested for-loop

I have an array of arrays filled with zeros, so this is the shape I want for the result.
I'm having trouble saving the nested for-loop to this array of arrays. In other words, I want to replace all of the zeros with what the last line calculates.
percent = []
for i in range(len(F300)):
for i in range(0,len(Name)):
for j in range(0,lengths[i]):
The last line only saves the last j value for each i.
I'm getting:
percent = [[0,0,1],[0,1],[0,0,0,1]]
but I want:
percent = [[.3,.6,1],[.5,1],[.25,.5,75,1]]
The problem with this code is that because it's in Python 2.7, the / operator is performing "classic" division. There are a couple different approaches to solve this in Python 2.7. One approach is to convert the numbers being divided into floating point numbers:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
for i in range(0, len(percent)):
for j in range(0, lengths[i]): #
percent[i][j] = float(j + 1) / float(lengths[i])
Another approach would be to import division from the __future__ package. However, this import line must be the first statement in your code.
from __future__ import division
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
for i in range(0, len(percent)):
for j in range(0, lengths[i]):
percent[i][j] = (j + 1) / lengths[i]
The third approach, and the one I personally prefer, is to make good use of NumPy's built-in functions:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = np.array([np.linspace(1.0 / float(l), 1.0, l) for l in lengths])
All three approaches will produce a list (or in the last case, numpy.ndarray object) of numpy.ndarray objects with the following values:
[[0.33333333, 0.66666667, 1.], [0.5, 1.], [0.25, 0.5, 0.75, 1.]]

Python: Finding a numpy array in a list of numpy arrays

I have a list of 50 numpy arrays called vectors:
[array([0.1, 0.8, 0.03, 1.5], dtype=float32), array([1.2, 0.3, 0.1], dtype=float32), .......]
I also have a smaller list (means) of 10 numpy arrays, all of which are from the bigger list above. I want to loop though each array in means and find its position in vectors.
So when I do this:
for c in means:
I get the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I've gone through various SO questions and I know why I'm getting this error, but I can't find a solution. Any help?
One possible solution is converting to a list.
vectors = np.array([[1, 2, 3], [4, 5, 6], [7,8,9], [10,11,12]], np.int32)
print vectors.tolist().index([1,2,3])
This will return index 0, because [1,2,3] can be found in index 0 of vectors
Th example above is a 2d Numpy array, however you seem to have a list of numpy arrays,
so I would convert it to a list of lists this way:
vectors = [arr.tolist() for arr in vectors]
Do the same for means:
means = [arr.tolist() for arr in means]
Now we are working with two lists of lists:
So your original 'for loop' will work:
for c in means:

Replace zero array with new values one by one NumPy

I stuck with a simple question in NumPy. I have an array of zero values. Once I generate a new value I would like to add it one by one.
# something like this
for x in l:
arr.append(x) # from python logic
so I would like to add one by one x into array, so I would get: 1st iteration arr=([1,0,0]); 2d iteration arr=([1,5,0]); 3rd arr=([1,5,10]);
Basically I need to substitute zeros with new values one by one in NumPy (I am learning NumPy!!!!!!).
I checked many of NumPy options like np.append (it adds to existing values new values), but can't find the right.
thank you
There are a few things to pick up with numpy:
you can generate the array full of zeros with
>>> np.zeros(3)
array([ 0., 0., 0.])
You can get/set array elements with indexing as with lists etc:
arr[2] = 7
for i, val in enumerate([1, 5, 10]):
arr[i] = val
Or, if you want to fill with array with something like a list, you can directly use:
>>> np.array([1, 5, 10])
array([ 1, 5, 10])
Also, numpy's signature for appending stuff to an array is a bit different:
arr = np.append(arr, 7)
Having said that, you should just consider diving into Numpy's own userguide.

Julia Approach to python equivalent list of lists

I just started tinkering with Julia and I'm really getting to like it. However, I am running into a road block. For example, in Python (although not very efficient or pythonic), I would create an empty list and append a list of a known size and type, and then convert to a NumPy array:
Python Snippet
a = []
for ....
b = numpy.array(a)
I want to be able to do something similar in Julia, but I can't seem to figure it out. This is what I have so far:
Julia snippet
a = Array{Float64}[]
for .....
The result is an n-element Array{Array{Float64,N},1} of size (n,), but I would like it to be an nx4 Array{Float64,2}.
Any suggestions or better way of doing this?
The literal translation of your code would be
# Building up as rows
a = [1. 2. 3. 4.]
for i in 1:3
a = vcat(a, [1. 2. 3. 4.])
# Building up as columns
b = [1.,2.,3.,4.]
for i in 1:3
b = hcat(b, [1.,2.,3.,4.])
But this isn't a natural pattern in Julia, you'd do something like
A = zeros(4,4)
for i in 1:4, j in 1:4
A[i,j] = j
or even
A = Float64[j for i in 1:4, j in 1:4]
Basically allocating all the memory at once.
Does this do what you want?
julia> a = Array{Float64}[]
0-element Array{Array{Float64,N},1}
julia> for i=1:3
julia> a
3-element Array{Array{Float64,N},1}:
julia> b = hcat(a...)'
3x4 Array{Float64,2}:
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
1.0 2.0 3.0 4.0
It seems to match the python output:
In [9]: a = []
In [10]: for i in range(3):
a.append([1, 2, 3, 4])
In [11]: b = numpy.array(a); b
array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
I should add that this is probably not what you actually want to be doing as the hcat(a...)' can be expensive if a has many elements. Is there a reason not to use a 2d array from the beginning? Perhaps more context to the question (i.e. the code you are actually trying to write) would help.
The other answers don't work if the number of loop iterations isn't known in advance, or assume that the underlying arrays being merged are one-dimensional. It seems Julia lacks a built-in function for "take this list of N-D arrays and return me a new (N+1)-D array".
Julia requires a different concatenation solution depending on the dimension of the underlying data. So, for example, if the underlying elements of a are vectors, one can use hcat(a) or cat(a,dims=2). But, if a is e.g a 2D array, one must use cat(a,dims=3), etc. The dims argument to cat is not optional, and there is no default value to indicate "the last dimension".
Here is a helper function that mimics the np.array functionality for this use case. (I called it collapse instead of array, because it doesn't behave quite the same way as np.array)
function collapse(x)
return cat(x...,dims=length(size(x[1]))+1)
One would use this as
a = []
for ...
... compute new_a...
a = collapse(a)