Related
numpy experts,
I'm using numpy.
I want to compare two arrays, get the largest value that is smaller than one of the arrays, and calculate the difference between them.
For example,
A = np.array([3, 5, 7, 12, 13, 18])
B = np.array([4, 7, 17, 20])
I want [1, 0, 4, 2] (4-3, 7-7, 17-13, 20-18) , in this case.
The problem is that the size of the A and B arrays is so large that it would take a very long time to do this by simple means. I can try to divide them to some size, but I wonder if there is a simple numpy function to solve this problem.
Or can I use numba?
For your information, This is my current very stupid codes.
delta = np.zeros_like(B)
for i in range(len(B)):
index_A = (A <= B[i]).argmin() - 1
delta[i] = B[i] - A[index_A]
I agree with #tarlen555 that the problem is mostly related to the for-loop. I guess this one is already much faster:
diff = B-A[:,np.newaxis]
diff[diff<0] = max(A.max(), B.max())
diff.min(axis=0)
In the second line, I wanted to fill all entries with negative values with something ridiculously large. Since your numbers are integer, np.inf doesn't work, but something like that could be more elegant.
EDIT:
Another way:
from scipy.spatial import cKDTree
tree = cKDTree(A.reshape(-1, 1))
k = 2
large_value = max(A.max(), B.max())
while True:
indices = tree.query(B.reshape(-1, 1), k=k)[1]
diff = B[:,np.newaxis]-A[indices]
if np.all(diff.max(axis=-1)>=0):
break
k += 1
diff[diff<0] = large_value
diff.min(axis=1)
This solution could be more memory-efficient but frankly I'm not sure how much more.
I have a strange feeling this is a very easy problem to solve but I'm not finding a good way of doing this without using brute force or dynamic programming. Here it goes:
Given N arrays of ordered and monotonic values, find the set of positions for each array i1, i2 ... in that minimises pair-wise difference of values at those indexes between all arrays. In other words, find the positions for all arrays whose values are closest to each other. Multiple solutions may exist and arrays may or may not be equally sized.
If A denotes the list of all arrays, the pair-wise difference is given by the sum of absolute differences between all values at the given indexes between all different arrays, as so:
An example, 3 arrays a, b and c:
a = [20 29 30 32 33]
b = [28 29 30 32 33]
c = [10 12 28 31 32 33]
The best alignment for this array would be a[3] b[3] c[4] or a[4] b[4] c[5], because (32,32,32) and (33,33,33) are all equal values and have, therefore minimum pairwise difference between each other. (Assuming array index starts at 0)
This is a common problem in bioinformatics thats usually solved with Dynamic Programming, but due to the fact this is an ordered sequence, I think there's somehow a way of exploiting this notion of order. I first thought about doing this pairwise, but this does not guarantee the global optimum because the best local answer might not be the best global answer.
This is meant to be language agnostic, but I don't really mind an answer for a specific language, as long as there is no loss of generality. I know Dynamic Programming is an option here, but I have a feeling there's an easier way to do this?
The tricky thing is parsing the arrays so that at some point you're guaranteed to be considering the set of indices that realize the pairwise min. Using a min heap on the values doesn't work. Counterexample with 4 arrays: [0,5], [1,2], [2], [2]. We start with a d(0,1,2,2) = 7, optimal is d(0,2,2,2) = 6, but the min heap moves us from 7 to d(5,1,2,2) = 12, then d(5,2,2,2) = 9.
I believe (but haven't proved) that if we alway increment the index that improves pairwise distance the most (or degrades it the least), we're guaranteed to visit every local min and the global min.
Assuming n total elements across k arrays:
Simple approach: we repeatedly get the pairwise distance deltas (delta wrt. incrementing each index), increment the best one, and any time doing so switch us from improvement to degradation (i.e. a local minimum) we calculate the pairwise distance. All this is O(k^2) per increment for a total running time of O((n-k) * (k^2)).
With O(k^2) storage, we could keep an array where (i,j) stores the pairwise distance delta achieve by increment the index of array i wrt. array j. We also store the column sums. Then on incrementing an index we can update the appropriate row & column & column sums in O(k). This gives us a running time of O((n-k)*k)
To just complete Dave's answer, here is the pseudocode of the delta algorithm:
initialise index_table to 0's where each row i denotes the index for the ith array
initialise delta_table with the corresponding cost of incrementing index of ith array and keeping the other indexes at their current values
cur_cost <- cost of current index table
best_cost <- cur_cost
best_solutions <- list with the current index table
while (can_at_least_one_index_increase)
i <- index whose delta is lowest
increment i-th entry of the index_table
if cost(index_table) < cur_cost
cur_cost = cost(index_table)
best_solutions = {} U {index_table}
if cost(index_table) = cur_cost
best_solutions = best_solutions U {index_table}
update delta_table
Important Note: During an iteration, some index_table entries might have already reached the maximum value for that array. Whenever updating the delta_table, it is necessary to never pick those values, otherwise this will result in a Array Out of Bounds,Segmentation Fault or undefined behaviour. A neat trick is to simply check which indexes are already at max and set a sufficiently large value, so they are never picked. If no index can increase anymore, the loop will end.
Here's an implementation in Python:
def align_ordered_sequences(arrays: list):
def get_cost(index_table):
n = len(arrays)
if n == 1:
return 0
sum = 0
for i in range(0, n-1):
for j in range(i+1, n):
v1 = arrays[i][index_table[i]]
v2 = arrays[j][index_table[j]]
sum += math.sqrt((v1 - v2) ** 2)
return sum
def compute_delta_table(index_table):
# Initialise the delta table: we switch each index element to 1, call
# the cost method and then revert the change, this avoids having to
# create copies, which decreases performance unnecessarily
delta_table = []
for i in range(n):
if index_table[i] + 1 >= len(arrays[i]):
# Implementation detail: if the index is outside the bounds of
# array i, choose a "large enough" number
delta_table.append(999999999999999)
else:
index_table[i] = index_table[i] + 1
delta_table.append(get_cost(index_table))
index_table[i] = index_table[i] - 1
return delta_table
def can_at_least_one_index_increase(index_table):
answer = False
for i in range(len(arrays)):
if index_table[i] < len(arrays[i]) - 1:
answer = True
return answer
n = len(arrays)
index_table = [0] * n
delta_table = compute_delta_table(index_table)
best_solutions = [index_table.copy()]
cur_cost = get_cost(index_table)
best_cost = cur_cost
while can_at_least_one_index_increase(index_table):
i = delta_table.index(min(delta_table))
index_table[i] = index_table[i] + 1
new_cost = get_cost(index_table)
# A new best solution was found
if new_cost < cur_cost:
cur_cost = new_cost
best_solutions = [index_table.copy()]
# A new solution with the same cost was found
elif new_cost == cur_cost:
best_solutions.append(index_table.copy())
# Update the delta table
delta_table = compute_delta_table(index_table)
return best_solutions
And here are some examples:
>>> print(align_ordered_sequences([[0,5], [1,2], [2], [2]]))
[[0, 1, 0, 0]]
>> print(align_ordered_sequences([[3, 5, 8, 29, 40, 50], [1, 4, 14, 17, 29, 50]]))
[[3, 4], [5, 5]]
Note 2: this outputs indexes not the actual values of each array.
I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).
for example:
arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]
arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]
I want to find common coordinates (same first 3 elements):
list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]
At the moment I'm doing this double loop, which is extremely slow:
list_arr=[]
for i in arr1:
for j in arr2:
if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
list_arr.append (j)
I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.
Can you help me with this?
Approach #1
Here's a vectorized one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]
Approach #2
Another with dimensionality-reduction for ints -
d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]
Improvement #1
We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -
unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]
Improvement #2
For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]
You can do it with the help of set
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])
# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]
aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])
Output
array([[7, 8, 9],
[1, 2, 3]])
Edit
Use list comprehension
l = [list(i) for i in arr2 if i[:3] in arr]
print(l)
Output:
[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]
For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:
1.+1e-15==1.
False
1.+1e-16==1.
True
If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.
import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
tree=spatial.cKDTree(arr2[:,0:3])
#You can check here if the distance is small enough and otherwise raise an error
dist,ind=tree.query(arr1, k=1)
return ind
It seems I am stuck on the following problem with numpy.
I have an array X with shape: X.shape = (nexp, ntime, ndim, npart)
I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).
A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):
import numpy as np
np.random.seed(12345)
X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape
cleanX = np.zeros_like(X)
for ne in range(nexp):
for nt in range(ntime):
indices = np.digitize(binvals[ne, nt, :], bins)
for nd in range(ndim):
for nb in range(1, len(bins)):
inds = indices==nb
cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
np.mean(X[ne, nt, nd, inds], axis = -1)
Looking at the results of this may make it clearer?
In [8]: X
Out[8]:
array([[[[-0.20470766, 0.47894334, -0.51943872, -0.5557303 ],
[ 1.96578057, 1.39340583, 0.09290788, 0.28174615],
[ 0.76902257, 1.24643474, 1.00718936, -1.29622111]],
[[ 0.27499163, 0.22891288, 1.35291684, 0.88642934],
[-2.00163731, -0.37184254, 1.66902531, -0.43856974],
[-0.53974145, 0.47698501, 3.24894392, -1.02122752]]]])
In [10]: cleanX
Out[10]:
array([[[[ 0. , 0.67768523, -0.32069682, -0.35698841],
[ 0. , 0.80405255, -0.49644541, -0.30760713],
[ 0. , 0.92730041, 0.68805503, -1.61535544]],
[[ 0.02303938, -0.02303938, 0.23324375, -0.23324375],
[-0.81489739, 0.81489739, 1.05379752, -1.05379752],
[-0.50836323, 0.50836323, 2.13508572, -2.13508572]]]])
In [12]: binvals
Out[12]:
array([[[ -5.77087303e-01, 1.24121276e-01, 3.02613562e-01,
5.23772068e-01],
[ 9.40277775e-04, 1.34380979e+00, -7.13543985e-01,
-8.31153539e-01]]])
Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!
import numpy as np
np.random.seed(100)
nexp = 3
ntime = 4
ndim = 5
npart = 100
nbins = 4
binvals = np.random.rand(nexp, ntime, npart)
X = np.random.rand(nexp, ntime, ndim, npart)
bins = np.linspace(0, 1, nbins + 1)
d = np.digitize(binvals, bins)[:, :, np.newaxis, :]
r = np.arange(1, len(bins)).reshape((-1, 1, 1, 1, 1))
m = d[np.newaxis, ...] == r
counts = np.sum(m, axis=-1, keepdims=True).clip(min=1)
means = np.sum(X[np.newaxis, ...] * m, axis=-1, keepdims=True) / counts
cleanX = X - np.choose(d - 1, means)
Ok, I think I got it, mainly based on the answer by #jdehesa.
clean2 = np.zeros_like(X)
d = np.digitize(binvals, bins)
for i in range(1, len(bins)):
m = d == i
minds = np.where(m)
sl = [*minds[:2], slice(None), minds[2]]
msum = m.sum(axis=-1)
clean2[sl] = (X - \
(np.sum(X * m[...,np.newaxis,:], axis=-1) /
msum[..., np.newaxis])[..., np.newaxis])[sl]
Which gives the same results as my original code.
On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.
Update:
Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.
Given two lists A={a1,a2,a3,...an} and B={b1,b2,b3,...bn}, I would say A>=B if and only if all ai>=bi.
There is a built-in logical comparison of two lists, A==B, but no A>B.
Do we need to compare each element like this
And##Table[A[[i]]>=B[[i]],{i,n}]
Any better tricks to do this?
EDIT:
Great thanks for all of you.
Here's a further question:
How to find the Maximum list (if exist) among N lists?
Any efficient easy way to find the maximum list among N lists with the same length using Mathematica?
Method 1: I prefer this method.
NonNegative[Min[a - b]]
Method 2: This is just for fun. As Leonid noted, it is given a bit of an unfair advantage for the data I used.
If one makes pairwise comparisons, and return False and Break when appropriate,
then a loop may be more efficient (although I generally shun loops in mma):
result = True;
n = 1; While[n < 1001, If[a[[n]] < b[[n]], result = False; Break[]]; n++]; result
Some timing comparisons on lists of 10^6 numbers:
a = Table[RandomInteger[100], {10^6}];
b = Table[RandomInteger[100], {10^6}];
(* OP's method *)
And ## Table[a[[i]] >= b[[i]], {i, 10^6}] // Timing
(* acl's uncompiled method *)
And ## Thread[a >= b] // Timing
(* Leonid's method *)
lessEqual[a, b] // Timing
(* David's method #1 *)
NonNegative[Min[a - b]] // Timing
Edit: I removed the timings for my Method #2, as they can be misleading. And Method #1 is more suitable as a general approach.
For instance,
And ## Thread[A >= B]
should do the job.
EDIT: On the other hand, this
cmp = Compile[
{
{a, _Integer, 1},
{b, _Integer, 1}
},
Module[
{flag = True},
Do[
If[Not[a[[p]] >= b[[p]]], flag = False; Break[]],
{p, 1, Length#a}];
flag],
CompilationTarget \[Rule] "C"
]
is 20 times faster. 20 times uglier, too, though.
EDIT 2: Since David does not have a C compiler available, here are all the timing results, with two differences. Firstly, his second method has been fixed to compare all elements. Secondly, I compare a to itself, which is the worst case (otherwise, my second method above will only have to compare elements up to the first to violate the condition).
(*OP's method*)
And ## Table[a[[i]] >= b[[i]], {i, 10^6}] // Timing
(*acl's uncompiled method*)
And ## Thread[a >= b] // Timing
(*Leonid's method*)
lessEqual[a, b] // Timing
(*David's method #1*)
NonNegative[Min[a - b]] // Timing
(*David's method #2*)
Timing[result = True;
n = 1; While[n < Length[a],
If[a[[n]] < b[[n]], result = False; Break[]];
n++]; result]
(*acl's compiled method*)
cmp[a, a] // Timing
So the compiled method is much faster (note that David's second method and the compiled method here are the same algorithm, and the only difference is overhead).
All these are on battery power so there may be some random fluctuations, but I think they are representative.
EDIT 3: If, as ruebenko suggested in a comment, I replace Part with Compile`GetElement, like this
cmp2 = Compile[{{a, _Integer, 1}, {b, _Integer, 1}},
Module[{flag = True},
Do[If[Not[Compile`GetElement[a, p] >= Compile`GetElement[b, p]],
flag = False; Break[]], {p, 1, Length#a}];
flag], CompilationTarget -> "C"]
then cmp2 is a twice as fast as cmp.
Since you mentioned efficiency as a factor in your question, you may find these functions useful:
ClearAll[lessEqual, greaterEqual];
lessEqual[lst1_, lst2_] :=
SparseArray[1 - UnitStep[lst2 - lst1]]["NonzeroPositions"] === {};
greaterEqual[lst1_, lst2_] :=
SparseArray[1 - UnitStep[lst1 - lst2]]["NonzeroPositions"] === {};
These functions will be reasonably efficient. The solution of #David is still two-four times faster, and if you want extreme speed and your lists are numerical (made of Integer or Real numbers), you should probably use compilation to C (the solution of #acl and similarly for other operators).
You can use the same techniques (using Unitize instead of UnitStep to implement equal and unequal), to implement other comparison operators (>, <, ==, !=). Keep in mind that UnitStep[0]==1.
Comparison functions like Greater, GreaterEqual, Equal, Less, LessEqual can be made to apply to lists in a number of ways (they are all variations of the approach in your question).
With two lists:
a={a1,a2,a3};
b={b1,b2,b3};
and two instances with numeric entries
na={2,3,4}; nb={1,3,2};
you can use
And##NonNegative[na-nb]
With lists with symoblic entries
And##NonNegative[na-nb]
gives
NonNegative[a1 - b1] && NonNegative[a2 - b2] && NonNegative[a3 - b3]
For general comparisons, one can create a general comparison function like
listCompare[comp_ (_Greater | _GreaterEqual | _Equal | _Less | _LessEqual),
list1_List, list2_List] := And ## MapThread[comp, {list1, list2}]
Using as
listCompare[GreaterEqual,na,nb]
gives True. With symbolic entries
listCompare[GreaterEqual,a,b]
gives the logially equivalent expression a1 <= b1 && a2 <= b2 && a3 <= b3.
When working with packed arrays and numeric comparator such as >= it would be hard to beat David's Method #1.
However, for more complicated tests that cannot be converted to simple arithmetic another method is required.
A good general method, especially for unpacked lists, is to use Inner:
Inner[test, a, b, And]
This does not make all of the comparisons ahead of time and can therefore be much more efficient in some cases than e.g. And ## MapThread[test, {a, b}]. This illustrates the difference:
test = (Print[#, " >= ", #2]; # >= #2) &;
{a, b} = {{1, 2, 3, 4, 5}, {1, 3, 3, 4, 5}};
Inner[test, a, b, And]
1 >= 1
2 >= 3
False
And ## MapThread[test, {a, b}]
1 >= 1
2 >= 3
3 >= 3
4 >= 4
5 >= 5
False
If the arrays are packed and especially if the likelihood that the return is False is high then a loop such as David's Method #2 is a good option. It may be better written:
Null === Do[If[a[[i]] ~test~ b[[i]], , Return#False], {i, Length#a}]
1 >= 1
2 >= 3
False