Performing an action on individual arrays in array of arrays - arrays

I have an array of arrays and I want to be able to perform some cleanup on each array individually within the array of arrays.
Here are the two arrays:
x = [array([4, 1, 2, 0]), array([5])]
y = [array([ 0.6, 0.7, 0.8, 0.9]), array([ 0.4])]
I want to find the top, let's say, 50% of the y-array and apply the same cut to the x-array so the indices match up.
So in this case I would want it to return:
x = [array([2, 0], array([5])]
y = [array([0.8, 0.9], array([0.4])]
Is this possible?

Related

array in array to array in numpy

Dear friends in stack overflow,
I have trouble calculation with Numpy and Sympy. A is defined by
import numpy as np
import sympy as sym
sym.var('x y')
f = sym.Matrix([0,x,y])
func = sym.lambdify( (x,y), f, "numpy")
X=np.array([1,2,3])
Y=np.array((1,2,3])
A = func(X,Y).
Here, X and Y are just examples. In general, X and Y are one dimensional array in numpy, and they have the same length. Then, A’s output is
array([[0],
[array([1, 2, 3])],
[array([1, 2, 3])]], dtype=object).
But, I'd like to get this as
np.array([[0,0,0],[1,2,3],[1,2,3]]).
If we call this B, How do you convert A to B automatically. B’s first column is filled by 0, and it has the same length with X and Y.
Do you have any ideas?
First let's make sure we understand what is happening:
In [52]: x, y = symbols('x y')
In [54]: f = Matrix([0,x,y])
...: func = lambdify( (x,y), f, "numpy")
In [55]: f
Out[55]:
⎡0⎤
⎢ ⎥
⎢x⎥
⎢ ⎥
⎣y⎦
In [56]: print(func.__doc__)
Created with lambdify. Signature:
func(x, y)
Expression:
Matrix([[0], [x], [y]])
Source code:
def _lambdifygenerated(x, y):
return (array([[0], [x], [y]]))
See how the numpy function looks just like the sympy, replacing sym.Matrix with np.array. lambdify just does a lexographic translation; it does not have a deep knowledge of the differences between the languages.
With scalars the func runs as expected:
In [57]: func(1,2)
Out[57]:
array([[0],
[1],
[2]])
With arrays the results is this ragged array (new enough numpy adds this warning:
In [59]: func(np.array([1,2,3]),np.array([1,2,3]))
<lambdifygenerated-2>:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return (array([[0], [x], [y]]))
Out[59]:
array([[0],
[array([1, 2, 3])],
[array([1, 2, 3])]], dtype=object)
If you don't know numpy, sympy is not a short cut to filling in your knowledge gaps.
The simplest fix is to replace original 0 with another symbol.
Even in sympy, the 0 is not expanded:
In [65]: f.subs({x:Matrix([[1,2,3]]), y:Matrix([[4,5,6]])})
Out[65]:
⎡ 0 ⎤
⎢ ⎥
⎢[1 2 3]⎥
⎢ ⎥
⎣[4 5 6]⎦
In [74]: Matrix([[0,0,0],[1,2,3],[4,5,6]])
Out[74]:
⎡0 0 0⎤
⎢ ⎥
⎢1 2 3⎥
⎢ ⎥
⎣4 5 6⎦
In [75]: Matrix([[0],[1,2,3],[4,5,6]])
...
ValueError: mismatched dimensions
To make the desired array in numpy we have to do something like:
In [71]: arr = np.zeros((3,3), int)
In [72]: arr[1:,:] = [[1,2,3],[4,5,6]]
In [73]: arr
Out[73]:
array([[0, 0, 0],
[1, 2, 3],
[4, 5, 6]])
That is, initial the array and fill selected rows. There isn't simple expression that will do the desired 'automaticlly fill the first row with 0', much less something that can be naively translated from sympy.

Selecting numpy array elements

I have the task of selecting p% of elements within a given numpy array. For example,
# Initialize 5 x 3 array-
x = np.random.randint(low = -10, high = 10, size = (5, 3))
x
'''
array([[-4, -8, 3],
[-9, -1, 5],
[ 9, 1, 1],
[-1, -1, -5],
[-1, -4, -1]])
'''
Now, I want to select say p = 30% of the numbers in x, so 30% of numbers in x is 5 (rounded up).
Is there a way to select these 30% of numbers in x? Where p can change and the dimensionality of numpy array x can be 3-D or maybe more.
I am using Python 3.7 and numpy 1.18.1
Thanks
You can use np.random.choice to sample without replacement from a 1d numpy array:
p = 0.3
np.random.choice(x.flatten(), int(x.size * p) , replace=False)
For large arrays, the performance of sampling without replacement can be pretty bad, but there are some workarounds.
You can randome choice 0,1 and usenp.nonzero and boolean indexing:
np.random.seed(1)
x[np.nonzero(np.random.choice([1, 0], size=x.shape, p=[0.3,0.7]))]
Output:
array([ 3, -1, 5, 9, -1, -1])
I found a way of selecting p% of numpy elements:
p = 20
# To select p% of elements-
x_abs[x_abs < np.percentile(x_abs, p)]
# To select p% of elements and set them to a value (in this case, zero)-
x_abs[x_abs < np.percentile(x_abs, p)] = 0

How does numpy determine the dimensions of a column vector?

I'm starting out with numpy and was trying to figure out how its arrays work for column vectors. Defining the following:
x1 = np.array([3.0, 2.0, 1.0])
x2 = np.array([-2.0, 1.0, 0.0])
And calling
print("inner product x1/x2: ", np.inner(x1, x2))
Produces inner product x1/x2: -4.0 as expected - this made me think that numpy assumes an array of this form is a column vector and, as part of the inner function, tranposes one of them to give a scalar. However, I wrote some code to test this idea and it gave some results that I don't understand.
After doing some googling about how to specify that an array is a column vector using .T I defined the following:
x = np.array([1, 0]).T
xT = np.array([1, 0])
Where I intended for x to be a column vector and xT to be a row vector. However, calling the following:
print(x)
print(x.shape)
print(xT)
print(xT.shape)
Produces this:
[1 0]
(2,)
[1 0]
(2,)
Which suggests the two arrays have the same dimensions, despite one being the transpose of the other. Furthermore, calling both np.inner(x,x) and np.inner(x,xT) produces the same result. Am I misunderstanding the .T function, or perhaps some fundamental feature of numpy/linear algebra? I don't feel like x & xT should be the same vector.
Finally, the reason I initially used .T was because trying to define a column vector as x = np.array([[1], [0]]) and calling print(np.inner(x, x)) produced the following as the inner product:
[[1 0]
[0 0]]
Which is the output you'd expect to see for the outer product. Am I misusing this way of defining a column vector?
Look at the inner docs:
Ordinary inner product of vectors for 1-D arrays
...
np.inner(a, b) = sum(a[:]*b[:])
With your sample arrays:
In [374]: x1 = np.array([3.0, 2.0, 1.0])
...: x2 = np.array([-2.0, 1.0, 0.0])
In [375]: x1*x2
Out[375]: array([-6., 2., 0.])
In [376]: np.sum(x1*x2)
Out[376]: -4.0
In [377]: np.inner(x1,x2)
Out[377]: -4.0
In [378]: np.dot(x1,x2)
Out[378]: -4.0
In [379]: x1#x2
Out[379]: -4.0
From the wiki for dot/scalar/inner product:
https://en.wikipedia.org/wiki/Dot_product
two equal-length sequences of numbers (usually coordinate vectors) and returns a single number
If vectors are identified with row matrices, the dot product can also
be written as a matrix product
Coming from a linear algebra world, it easy to think of everything in terms of matrices (2d) and vectors, which are 1 row or 1 column matrices. MATLAB/Octave works in that framework. But numpy is more general, with arrays with 0 or more dimensions, not just 2.
np.transpose does not add dimensions, it just permutes the existing ones. Hence x1.T does not change anything.
A column vector can be made with np.array([[1], [0]]) or:
In [381]: x1
Out[381]: array([3., 2., 1.])
In [382]: x1[:,None]
Out[382]:
array([[3.],
[2.],
[1.]])
In [383]: x1.reshape(3,1)
Out[383]:
array([[3.],
[2.],
[1.]])
np.inner describes what happens when the inputs not 1d, such as your 2d (2,1) shape x. It says it uses np.tensordot which is a generalization of np.dot, matrix product.
In [386]: x = np.array([[1],[0]])
In [387]: x
Out[387]:
array([[1],
[0]])
In [388]: np.inner(x,x)
Out[388]:
array([[1, 0],
[0, 0]])
In [389]: np.dot(x,x.T)
Out[389]:
array([[1, 0],
[0, 0]])
In [390]: x*x.T
Out[390]:
array([[1, 0],
[0, 0]])
This is the elementwise product of (2,1) and (1,2) resulting in a (2,2), or outer product.

NumbaPro - Smartest way to sort a 2d array and then sum over entries of same key

In my program I have an array with the size of multiple million entries like this:
arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]
I could instead make two arrays with the same order (instead of an array with touples) if that helps.
For sorting this array I know I can use radix sort so it has this structure:
arr_sorted = [(1,0.5), (1,0.01), (2,0.42), ...]
Now I want to sum over all the values from the array that have the key 1. Then all that have the key 2 etc. That should be written into a new array like this:
arr_summed = [(1, 0.51), (2,0.42), ...]
Obviously this array would be much smaller, although still on the order of 100000 Entrys. Now my question is: What's the best parallel approach to my problem in CUDA? I am using NumbaPro.
Edit for clarity
I would have two arrays instead of a list of tuples like this:
keys = [1, 2, 5, 2, 6, 4, 4, 65, 3215, 1, .....]
values = [0.1, 0.4, 0.123, 0.01, 0.23, 0.1, 0.1, 0.4 ...]
They are initially numpy arrays that get copied to the device.
What I want is to reduce them by key and if possible set missing key values (for example if three doesn't appear in the array) to zero.
So I would want it go become:
keys = [1, 2, 3, 4, 5, 6, 7, 8, ...]
values = [0.11, 0.41, 0, 0.2, ...] # <- Summed by key
I know how big the final array will be beforehand.
I don't know Numba, but in simple Python:
arr=[(1,0.5), (4,0.2), (321, 0.01), (2, 0.042), (1, 0.01), ...]
res = [0.0] * (indexmax + 1)
for k, v in arr:
res[k] += v

removing duplicates from list of arrays and concatenating associated values

I have two lists of arrays of data (cost and account) identified by the list 'number'. Each array in the lists are of different lengths, but each cost array has a corresponding account array of the same length.
I would like to remove the duplicates in the list number and concatenate the corresponding data in cost and account together for each duplicate. The ordering is important. Here's an example of the lists I have:
number = [4, 6, 8, 4, 8]
cost = [array([1,2,3], dtype = uint64), array([5,6,7,8], dtype = uint64), array([9,10,11], dtype= uint64), array([13,14,15], dtype = uint64), array([17,18], dtype = uint64)]
account = [array([.1,.2,.3], dtype = float32), array([.5,.6,.7,.8], dtype = float32), array([.5,.10,.11], dtype= float32), array([.13,.14,.15], dtype = float32), array([32,.18], dtype = float32)]
The desired result is to have:
number = [4,6,8]
cost = [[1,2,3,13,14,15],[5,4,7,8],[9,10,11,17,18]]
account = [[.1,.2,.3,.13,.14,.15],[.5,.6,.7,.8],[.5,.10,.11,32,.18]]
Is there an easy way to do this with indexing or dictionaries?
If number order is not important (e.g [8,4,6]), You can do as follows:
number = [4, 6, 8, 4, 8]
cost = [[1,2,3],[5,6,7],[9,10,11],[13,14,15],[17,18,19]]
account = [[.1,.2,.3],[.5,.6,.7],[.9,.0,.1],[.3,.4,.5],[.7,.8,.9]]
duplicates = lambda lst, item: [i for i, x in enumerate(lst) if x == item]
indexes = dict((n, duplicates(number, n)) for n in set(number))
number = list(set(number))
cost = [sum([cost[num] for num in val], []) for valin indexes.values()]
account = [sum([account[num] for num in val], []) for valin indexes.values()]
The indexes will be dictionary with number as key and indexes as values using duplicates finder lambda function.
You can do:
# Find unique values in "number"
number, idx, inv = np.unique(number,return_index=True,return_inverse=True)
# concatenate "cost" based on unique values
cost = [np.asarray(cost)[np.where(inv==i)[0]].flatten().tolist() \
for i in idx ]
# concatenate "account" based on unique values
account = [np.asarray(account)[np.where(inv==i)[0]].flatten().tolist() \
for i in idx ]
# Check
In [248]: number
[4 6 8]
In [249]: cost
[[1, 2, 3, 13, 14, 15], [5, 6, 7], [9, 10, 11, 17, 18, 19]]
In [250]: account
[[0.1, 0.2, 0.3, 0.3, 0.4, 0.5], [0.5, 0.6, 0.7], [0.9, 0.0, 0.1, 0.7, 0.8, 0.9]]
np.asarray() and tolist() are unnecessary if your inputs are numpy arrays, so you might want to get rid of them. I just added them so that they work for Python lists too.

Resources