Find common elements in subarrays of arrays - arrays

I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).
for example:
arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]
arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]
I want to find common coordinates (same first 3 elements):
list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]
At the moment I'm doing this double loop, which is extremely slow:
list_arr=[]
for i in arr1:
for j in arr2:
if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
list_arr.append (j)
I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.
Can you help me with this?

Approach #1
Here's a vectorized one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]
Approach #2
Another with dimensionality-reduction for ints -
d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]
Improvement #1
We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -
unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]
Improvement #2
For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]

You can do it with the help of set
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])
# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]
aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])
Output
array([[7, 8, 9],
[1, 2, 3]])
Edit
Use list comprehension
l = [list(i) for i in arr2 if i[:3] in arr]
print(l)
Output:
[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]

For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:
1.+1e-15==1.
False
1.+1e-16==1.
True
If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.
import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
tree=spatial.cKDTree(arr2[:,0:3])
#You can check here if the distance is small enough and otherwise raise an error
dist,ind=tree.query(arr1, k=1)
return ind

Related

numpy array difference to the largest value in another large array which less than the original array

numpy experts,
I'm using numpy.
I want to compare two arrays, get the largest value that is smaller than one of the arrays, and calculate the difference between them.
For example,
A = np.array([3, 5, 7, 12, 13, 18])
B = np.array([4, 7, 17, 20])
I want [1, 0, 4, 2] (4-3, 7-7, 17-13, 20-18) , in this case.
The problem is that the size of the A and B arrays is so large that it would take a very long time to do this by simple means. I can try to divide them to some size, but I wonder if there is a simple numpy function to solve this problem.
Or can I use numba?
For your information, This is my current very stupid codes.
delta = np.zeros_like(B)
for i in range(len(B)):
index_A = (A <= B[i]).argmin() - 1
delta[i] = B[i] - A[index_A]
I agree with #tarlen555 that the problem is mostly related to the for-loop. I guess this one is already much faster:
diff = B-A[:,np.newaxis]
diff[diff<0] = max(A.max(), B.max())
diff.min(axis=0)
In the second line, I wanted to fill all entries with negative values with something ridiculously large. Since your numbers are integer, np.inf doesn't work, but something like that could be more elegant.
EDIT:
Another way:
from scipy.spatial import cKDTree
tree = cKDTree(A.reshape(-1, 1))
k = 2
large_value = max(A.max(), B.max())
while True:
indices = tree.query(B.reshape(-1, 1), k=k)[1]
diff = B[:,np.newaxis]-A[indices]
if np.all(diff.max(axis=-1)>=0):
break
k += 1
diff[diff<0] = large_value
diff.min(axis=1)
This solution could be more memory-efficient but frankly I'm not sure how much more.

How do I combine the coordinate pairs of an array into a single index?

I have an array
A = [3, 4; 5, 6; 4, 1];
Is there a way I could convert all coordinate pairs of the array into linear indices such that:
A = [1, 2, 3]'
whereby (3,4), (5,6), and (4,1) are represented by 1, 2, and 3, respectively.
Many thanks!
The reason I need is because I need to loop through array A such that I could make use of each coordinate pairs (3,4), (5,6), and (4,1) at the same time. This is because I will need to feed each of these pairs into a function so as to make another computation. See pseudo code below:
for ii = 1: length(A);
[x, y] = function_obtain_coord_pairs(A);
B = function_obtain_fit(x, y, I);
end
whereby, at ii = 1, x=3 and y=4. The next iteration takes the pair x=5, y=6, etc.
Basically what will happen is that my kx2 array will be converted to a kx1 array. Thanks for your help.
Adapting your code, what you want was suggested by #Ander in the comments...
Your code
for ii = 1:length(A);
[x, y] = function_obtain_coord_pairs(A);
B = function_obtain_fit(x, y, I);
end
Adapted code
for ii = 1:size(A,1);
x = A(ii, 1);
y = A(ii, 2);
B = function_obtain_fit(x, y, I); % is I here supposed to be ii? I not defined...
end
Your unfamiliarly with indexing makes me think your function_obtain_fit function could probably be vectorised to accept the entire matrix A, but that's a matter for another day!
For instance, you really don't need to define x or y at all...
Better code
for ii = 1:size(A,1);
B = function_obtain_fit(A(ii, 1), A(ii, 2), I);
end
Here is a corrected version for your code:
A = [3, 4; 5, 6; 4, 1];
for k = A.'
B = function_obtain_fit(k(1),k(2),I)
end
By iterating directly on A you iterate over the columns of A. Because you want to iterate over the rows we need to take A.'. So if we just display k it is:
for k = A.'
k
end
the output is:
k =
3
4
k =
5
6
k =
4
1

Using memoization for storing values in ruby array

For a short array the following function works well. It's supposed to return the first array pair that whe sum is equal to a given integer. However, if the array has a length upwards of 10 million elements, the request times out, because (I think) is storing thousands of values in the variable I create in the first line. I know I have to use memoization (||=) but have no idea how to use it.
array1 = [1,2,3,4,5,6,7]
number = 3
array2 = [1,2,3.....n] # millions of elements
combos = array1.combination(2).to_a
(combos.select { |x,y| x + y == number }).sort.first
I need to gather all possible pairs to sort them, I'm using select to go through the entire list and not stop at the first pair that returns true.
This is one of the possible solutions.
def sum_pairs(ints, s)
seen = {}
for i in ints do
return [s-i, i] if seen[s-i]
seen[i] = true
end
nil
end
def find_smallest(arr, nbr)
first, *rest = arr.sort
until rest.empty?
matching = rest.bsearch { |n| n == nbr - first }
return [first, matching] unless matching.nil?
first, *rest = rest
end
nil
end
arr = [12, 7, 4, 5, 14, 9]
find_smallest(arr, 19) #=> [5, 14]
find_smallest(arr, 20) #=> nil
I've used the method Array#bsearch (rather than Enumerable#find to speed up the search for an element equal to nbr - first (O(log rest.size) vs. O(rest.size)).

indexing rows in matrix using matlab

Suppose I have an empty m-by-n-by-p dimensional cell called "cellPoints", and I also have a D-by-3 dimensional array called "cellIdx" where each row i contains the subscripts in "cellPoints". Now I want to compute "cellPoints" so that cellPoints{x, y, z} contains an array of row numbers in "cellIdx".
A naive implementation could be
for i = 1:size(cellIdx, 1)
cellPoints{cellIdx(i, 1), cellIdx(i, 2), cellIdx(i, 3)} = ...
[cellPoints{cellIdx(i, 1), cellIdx(i, 2), cellIdx(i, 3)};i];
end
As an example, suppose
cellPoints = cell(10, 10, 10);% user defined, cannot change
cellIdx = [1, 3, 2;
3, 2, 1;
1, 3, 2;
1, 4, 2]
Then
cellPoints{1, 3, 2} = [1;3];
cellPoints{3, 2, 1} = [2];
cellPoints{1, 4, 2} = [4];
and other indices of cellPoints should be empty
Since cellIdx is a large matrix and this is clearly inefficient, are there any other better implementations?
I've tried using unique(cellIdx, 'rows') to find unique rows in cellIdx, and then writing a for-loop to compute cellPoints, but it's even slower than above.
See if this is faster:
cellPoints = cell(10,10,10); %// initiallize to proper size
[~, jj, kk] = unique(cellIdx, 'rows', 'stable')
sz = size(cellPoints);
sz = [1 sz(1:end-1)];
csz = cumprod(sz).'; %'// will be used to build linear index
ind = 1+(cellIdx(jj,:)-1)*csz; %// linear index to fill cellPoints
cellPoints(ind) = accumarray(kk, 1:numel(kk), [], #(x) {sort(x)});
Or remove sort from the last line if order within each cell is not important.

'Array of arrays' in matlab?

Hey, having a wee bit of trouble. Trying to assign a variable length 1d array to different values of an array, e.g.
a(1) = [1, 0.13,0.52,0.3];
a(2) = [1, 0, .268];
However, I get the error:
??? In an assignment A(I) = B, the number of elements in B and
I must be the same.
Error in ==> lab2 at 15
a(1) = [1, 0.13,0.52,0.3];
I presume this means that it's expecting a scalar value instead of an array. Does anybody know how to assign the array to this value?
I'd rather not define it directly as a 2d array as it is for are doing solutions to different problems in a loop
Edit: Got it!
a(1,1:4) = [1, 0.13,0.52,0.3];
a(2,1:3) = [1, 0, .268];
What you probably wanted to write was
a(1,:) = [1, 0.13,0.52,0.3];
a(2,:) = [1, 0, .268];
i.e the the first row is [1, 0.13,0.52,0.3] and the second row is [1, 0, .268]. This is not possible, because what would be the value of a(2,4) ?
There are two ways to fix the problem.
(1) Use cell arrays
a{1} = [1, 0.13,0.52,0.3];
a{2} = [1, 0, .268];
(2) If you know the maximum possible number of columns your solutions will have, you can preallocate your array, and write in the results like so (if you don't preallocate, you'll
get zero-padding. You also risk slowing down your loop a lot, if there are many iterations, because the array will have to be recreated at every iteration.
a = NaN(nIterations,maxNumCols); %# this fills the array with not-a-numbers
tmp = [1, 0.13,0.52,0.3];
a(1,1:length(tmp)) = tmp;
tmp = [1, 0, .268];
a(2,1:length(tmp)) = tmp;

Resources