How to get similar elements of two numpy arrays with a tolerance - arrays

I would like to compare values from columns of two different numpy arrays A and B. More specifically, A contains values from a real experiment that I want to match with theoretical values that are given in the third column of B.
There are no perfect matches and therefore I have to use a tolerance, e.g. 0.01. For each value in A, I expect 0 to 20 matches in B with respect to the selected tolerance. As a result, I would like to get those lines in B that are within the tolerance to a value in A.
To be more specific, here an example:
A = array([[ 2.83151742e+02, a0],
[ 2.83155339e+02, a1],
[ 3.29241719e+02, a2],
[ 3.29246229e+02, a3]])
B = array([[ 0, 0, 3.29235519e+02, ...],
[ 0, 0, 3.29240819e+02, ...],
[ 0, 0, 3.29241919e+02, ...],
[ 0, 0, 3.29242819e+02, ...]])
So here all values of B would match A[3,0] and A[4,0] for a tolerance of 0.02.
My preferred result would like this with the matched value of A in C[:,0] and the difference between C[:,0] and C[:,2] in C[:,1]:
C = array([[ 3.29241719e+02, c0, 3.29235519e+02, ...],
[ 3.29241719e+02, c1, 3.29240819e+02, ...],
[ 3.29241719e+02, c2, 3.29241919e+02, ...],
[ 3.29241719e+02, c3, 3.29242819e+02, ...]
[ 3.29242819e+02, c4, 3.29235519e+02, ...],
[ 3.29242819e+02, c5, 3.29240819e+02, ...],
[ 3.29242819e+02, c6, 3.29241919e+02, ...],
[ 3.29242819e+02, c7, 3.29242819e+02, ...]])
Typically, A has shape (500, 2) and B has shape (300000, 11). I can solve it with for-loops, yet it takes ages.
What would be the most efficient way for this comparison?

I'd imagine it would be something like
i = np.nonzero(np.isclose(A[:,:,None], B[:, 2]))[-1]
np.isclose accepts a few different tolerance parameters.
The values in B close to the A values would then be B[i, 2]

Related

Numpy 2-D array boolean masking

I don't understand one example in this numpy tutorial.
a = np.arange(12).reshape(3,4)
b1 = np.array([False, True, True])
b2 = np.array([True, False, True, False])
Then why will a[b1,b2] return array([4, 10])? Shouldn't it return array([[4, 6], [8, 10]])?
Any detailed explanation is appreciated!
When you index an array with multiple arrays, it indexes with pairs of elements from the indexing arrays
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b1
array([False, True, True], dtype=bool)
>>> b2
array([ True, False, True, False], dtype=bool)
>>> a[b1, b2]
array([ 4, 10])
Notice that this is equivalent to:
>>> a[(1, 2), (0, 2)]
array([ 4, 10])
which are the elements at a[1, 0] and a[2, 2]
>>> a[1, 0]
4
>>> a[2, 2]
10
Because of this pairwise behavior, you cannot in general index with separate length arrays (they have to be able to broadcast). So this example is sort of an accident since both indexing arrays have two indices where they are True; if one had three True values for example, you'd get an error:
>>> b3 = np.array([True, True, True, False])
>>> a[b1, b3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (3,)
So this is specifically letting you know that the indexing arrays must be able to be broadcast together (so that it can chip off indices together in a smart way; e.g. if one indexing array just had a single value, that would be repeated with each value from the other indexing array).
To get the results you expect, you could index the result separately:
>>> a[b1][:, b2]
array([[ 4, 6],
[ 8, 10]])
Otherwise, you could also turn your index array into a 2D array with the same shape as a, but note that if you do that the result will be a linear array (since any number of elements could be pulled out, which of course might not be square):
>>> a[np.outer(b1, b2)]
array([ 4, 6, 8, 10])
The indices of true for the first array are
>>> i = np.where(b1)
>>> i
array([1,2])
For the second array they are
>>> j = np.where(b2)
>>> j
array([0,1])
Using these index masks together,
>>> a[i,j]
array([4, 10])
Another way to apply a general boolean 2D mask on a 2D numpy array is the following:
Use matrix element-wise multiplication:
import numpy as np
n = 100
mask = np.identity(n)
data = np.random.rand(n,n)
data_masked = data * mask
In this random example, you are keeping only the elements on the diagonal. The mask could be any n by n matrix though.

Selective reshaping of 3d array to 2d array

I'm working with a 3d array of vectors, and having trouble reshaping properly.
My dimensions correspond to quantities as follows:
0 = vector (3)
1 = point (4)
2 = polyline (2)
So this can be interpreted as 2 polylines that each contain 4 points, and each point has a vector. I want to reshape to a 2d matrix that is (3, 8).
The original array is:
poly_array = array([[[-0.707, 0.0],
[-0.371, 0.0],
[0.371, 0.0],
[0.707, 0.0]],
[[0.0, -0.707],
[0.0, 0.0],
[0.0, 0.707],
[0.0, 0.0]],
[[0.707, 0.707],
[0.928, 1.0],
[0.928, 0.707],
[0.707, 0.0]]])
so if I'm looking at ordered points along the first polyline, I would run:
for i in range(4):
print poly_array[:,i,0]
or for ordered points along the second polyline:
for i in range(4):
print poly_array[:,i,1]
If I reshape this way:
new_dim = shape(poly_array)[1] * shape(poly_array)[2]
new_array = poly_array.reshape(3, new_dim)
But this orders the vectors as taking one from each polyline (i.e., pt0-polyline0, pt0-polyline1, pt1-polyline0, pt1-polyline1, etc.)
In: print new_array[:, 0]
Out: [-0.707 0. 0.707]
In: print new_array[:, 1]
Out: [ 0. -0.707 0.707]
But I want
In: print new_array[:, 1]
Out: [-0.371 0. 0.928]
How can I reshape so that it loops through all the vectors corresponding to points (along axis 1) for a given polyline before the next polyline?
You would need some permuting of axes with np.swapaxes and a reshape -
poly_array.swapaxes(1,2).reshape(poly_array.shape[0],-1)
Sample run -
In [127]: poly_array
Out[127]:
array([[[-0.707, 0. ],
[-0.371, 0. ],
[ 0.371, 0. ],
[ 0.707, 0. ]],
[[ 0. , -0.707],
[ 0. , 0. ],
[ 0. , 0.707],
[ 0. , 0. ]],
[[ 0.707, 0.707],
[ 0.928, 1. ],
[ 0.928, 0.707],
[ 0.707, 0. ]]])
In [142]: out
Out[142]:
array([[-0.707, -0.371, 0.371, 0.707, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , -0.707, 0. , 0.707, 0. ],
[ 0.707, 0.928, 0.928, 0.707, 0.707, 1. , 0.707, 0. ]])
In [143]: out[:,1]
Out[143]: array([-0.371, 0. , 0.928])

Efficient data sifting for unique values (Python)

I have a 2D Numpy array that consists of (X,Y,Z,A) values, where (X,Y,Z) are Cartesian coordinates in 3D space, and A is some value at that location. As an example..
__X__|__Y__|__Z__|__A_
13 | 7 | 21 | 1.5
9 | 2 | 7 | 0.5
15 | 3 | 9 | 1.1
13 | 7 | 21 | 0.9
13 | 7 | 21 | 1.7
15 | 3 | 9 | 1.1
Is there an efficient way to find all the unique combinations of (X,Y), and add their values? For example, the total for (13,7) would be (1.5+0.9+1.7), or 4.1.
scipy.sparse matrix takes this kind of information, but for just 2d
sparse.coo_matrix((data, (row, col)))
where row and col are indices like your X,Y and Z. It sums duplicates.
The first step to doing that is a lexical sort of the indices. That puts points with matching coordinates next to each other.
The actually grouping and summing is done, I believe, in compiled code. Part of the difficulty in doing that fast in numpy terms is that there will be a variable number of elements in each group. Some will be unique, others might have 3 or more.
Python itertools has a groupby tool. Pandas also has grouping functions. I can also imagine using a default_dict to group and sum values.
The ufunc reduceat might also work, though it's easier to use in 1d than 2 or 3.
If you are ignoring the Z, the sparse coo_matrix approach may be easiest.
In [2]: X=np.array([13,9,15,13,13,15])
In [3]: Y=np.array([7,2,3,7,7,3])
In [4]: A=np.array([1.5,0.5,1.1,0.9,1.7,1.1])
In [5]: M=sparse.coo_matrix((A,(X,Y)))
In [15]: M.sum_duplicates()
In [16]: M.data
Out[16]: array([ 0.5, 2.2, 4.1])
In [17]: M.row
Out[17]: array([ 9, 15, 13])
In [18]: M.col
Out[18]: array([2, 3, 7])
In [19]: M
Out[19]:
<16x8 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
Here's what I had in mind with lexsort
In [32]: Z=np.array([21,7,9,21,21,9])
In [33]: xyz=np.stack([X,Y,Z],1)
In [34]: idx=np.lexsort([X,Y,Z])
In [35]: idx
Out[35]: array([1, 2, 5, 0, 3, 4], dtype=int32)
In [36]: xyz[idx,:]
Out[36]:
array([[ 9, 2, 7],
[15, 3, 9],
[15, 3, 9],
[13, 7, 21],
[13, 7, 21],
[13, 7, 21]])
In [37]: A[idx]
Out[37]: array([ 0.5, 1.1, 1.1, 1.5, 0.9, 1.7])
When sorted like this it becomes more evident that the Z coordinate is 'redundant', at least for this purpose.
Using reduceat to sum groups:
In [40]: np.add.reduceat(A[idx],[0,1,3])
Out[40]: array([ 0.5, 2.2, 4.1])
(for now I just eyeballed the [0,1,3] list)
Approach #1
Get each row as a view, thus converting each into a scalar each and then use np.unique to tag each row as a minimum scalar starting from (0......n), withnas no. of unique scalars based on the uniqueness among others and finally usenp.bincount` to perform the summing of the last column based on the unique scalars obtained earlier.
Here's the implementation -
def get_row_view(a):
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
a = np.ascontiguousarray(a)
return a.reshape(a.shape[0], -1).view(void_dt).ravel()
def groupby_cols_view(x):
a = x[:,:2].astype(int)
a1D = get_row_view(a)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Approach #2
Same as approach #1, but instead of working with the view, we will generate equivalent linear index equivalent for each row and thus reducing each row to a scalar. Rest of the workflow is same as with the first approach.
The implementation -
def groupby_cols_linearindex(x):
a = x[:,:2].astype(int)
a1D = a[:,0] + a[:,1]*(a[:,0].max() - a[:,1].min() + 1)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Sample runs
In [80]: data
Out[80]:
array([[ 2. , 5. , 1. , 0.40756048],
[ 3. , 4. , 6. , 0.78945661],
[ 1. , 3. , 0. , 0.03943097],
[ 2. , 5. , 7. , 0.43663582],
[ 4. , 5. , 0. , 0.14919507],
[ 1. , 3. , 3. , 0.03680583],
[ 1. , 4. , 8. , 0.36504428],
[ 3. , 4. , 2. , 0.8598825 ]])
In [81]: groupby_cols_view(data)
Out[81]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 2. , 5. , 0.8441963 ],
[ 3. , 4. , 1.64933911],
[ 4. , 5. , 0.14919507]])
In [82]: groupby_cols_linearindex(data)
Out[82]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 3. , 4. , 1.64933911],
[ 2. , 5. , 0.8441963 ],
[ 4. , 5. , 0.14919507]])

strange behavior when updating matrix

import numpy as np
X_mini=np.array([[ 4, 2104, 1],
[ 1, 1600, 3],
[ 3, 2400, 100]])
def feature_normalization(X):
row_length=len(X[0:1][0])
for i in range(0, row_length):
if not X[:,i].std()==0:
temp=(X[:,i]-X[:,i].mean())/X[:,i].std()
print(temp)
X[:,i]=temp
feature_normalization(X_mini)
print(X_mini)
outputs:
[ 1.06904497 -1.33630621 0.26726124]
[ 0.209937 -1.31614348 1.10620649]
[-0.72863911 -0.68535362 1.41399274]
[[ 1 0 0]
[-1 -1 0]
[ 0 1 1]]
my question is, why does not X_mini (after applying feature_normalization) correspond to what is being printed out?
Your array holds values of integer type (probably int64).
When fractions are inserted into it, they're converted to int.
You can explicitly specify the type of an array you create:
X_mini = np.array([[ 4.0, 2104.0, 1.0],
[ 1.0, 1600.0, 3.0],
[ 3.0, 2400.0, 100.0]], dtype=np.float128)
You can also convert an array to another type using numpy.ndarray.astype (docs).

Sort (symmetric) numpy 2D array by function. (norm)

How to sort a matrix by the norm of its rows efficiently(using numpy.ndarrays)?
I want to sort the matrix A:
A = np.array( ( [ 10, 1, 6, 3 ],
[ 1,12, 2, 4 ],
[ 6, 2,14, 5 ],
[ 3, 4, 5, 9 ] ) )
by the norm of its rows.
What I do now is to create a list of the norm and get the indexlist of that list and sort the matrix based on that indexlist. Is this the way to go?
indexlist = np.argsort( np.apply_along_axis( np.linalg.norm, 0, A))
#indexlist = array([3, 0, 1, 2])
then my sorted list.
sortedA = A[indexlist]
and the symmetric sorted list would then be
sym_sortedA = A[indexlist][:,indexlist]
Yes, this is the most common way to do that. A bit shorter would be to use
indexlist = np.argsort(np.linalg.norm(A,axis=1))
You need to use axis=1 if you want to sort by rows, but since the matrix is symmetric that doesn't matter.

Resources