Numpy Sum Rows of 2D Array uniquely (no sequence duplicates) - arrays

I have the following array
import numpy as np
single_array =
[[ 1 80 80 80]
[ 2 80 80 89]
[ 3 52 50 90]
[ 4 39 34 54]
[ 5 37 47 32]
[ 6 42 42 27]
[ 7 42 52 27]
[ 8 38 33 28]
[ 9 42 37 42]]
and want to create another array with all unique sums of 2 rows within this single_array so that 1+2 and 2+1 are treated as duplicates and are only included once.
First I would like to update the 0th column of the array to multiply each value by 10 (so I can identify the corresponding matching), then I want to add up every 2 rows and append them into the new array.
Output should look like this:
double_array=
[[12 160 160 169]
[13 132 130 170]
[14 119 114 134]
...
[98 80 70 70]]
Can I use itertools.combinations to get a 3D array with two unique combinations and then add the rows on the corresponding 3rd axis?

This
import numpy as np
from itertools import combinations
single_array = np.array(
[[ 1, 80, 80, 80],
[ 2, 80, 80, 89],
[ 3, 52, 50, 90],
[ 4, 39, 34, 54],
[ 5, 37, 47, 32],
[ 6, 42, 42, 27],
[ 7, 42, 52, 27],
[ 8, 38, 33, 28],
[ 9, 42, 37, 42]]
)
np.vstack([single_array[i] * np.array([10, 1, 1, 1]) + single_array[j]
for i, j in combinations(range(single_array.shape[0]), 2)])
does what you ask for in terms of specified input and output; I'm not sure if it's what you actually need. I don't think it will scale to big inputs.
A 3D array to find this sum would be ragged (first "layer" would be 9 deep, next one 8, etc.); you could maybe get around this with NaNs or masking. It also wouldn't scale that well for big inputs: you'd be allocating twice as much memory as you need, and then have to index out ragged layers to get your final output.
If you have to do this fast for big arrays, I suggest a pre-allocated output array and a for-loop with Numba:
from numba import jit
#jit(nopython=True)
def unique_row_sums(a):
n = a.shape[0]
b = np.empty((n*(n-1)//2, a.shape[1]))
s = np.array([10, 1, 1, 1])
k = 0
for i in range(n):
for j in range(i+1, n):
b[k] = s * a[i] + a[j]
k += 1
return b
In my not-too-careful testing with IPython's %timeit, this took about 4µs versus 152µs for the itertools-based version with your data, and should scale better.

Related

Numpy: How to slice or split 2D subsections of 2D array

I have a 2D array. For example:
ary = np.arange(24).reshape(6,4)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
I want to break this into smaller 2D arrays, each 2x2, and compute the square root of the sum of each. I actually want to use arbitrary sized sub-arrays, and compute arbitrary functions of them, but I think this question is easier to ask with concrete operations and concrete array sizes, so in this example starting with a 6x4 array and computing the square root of sums of 2x2 sub-arrays, the final result would be a 3x2 array, as follows:
[[3.16, 4.24] # math.sqrt(0+1+4+5) , math.sqrt(2+3+6+7)
[6.48, 7.07] # math.sqrt(8+9+12+13) , math.sqrt(10+11+14+15)
[8.60, 9.05]] # math.sqrt(16+17+20+21), math.sqrt(18+19+22+23)
How can I slice, or split, or do some operation to perform some computation on 2D sub-arrays?
Here is a working, inefficient example of what I'm trying to do:
import numpy as np
a_height = 6
a_width = 4
a_area = a_height * a_width
a = np.arange(a_area).reshape(a_height, a_width)
window_height = 2
window_width = 2
b_height = a_height // window_height
b_width = a_width // window_width
b_area = b_height * b_width
b = np.zeros(b_area).reshape(b_height, b_width)
for i in range(b_height):
for j in range(b_width):
b[i, j] = a[i * window_height:(i + 1) * window_height, j * window_width:(j + 1) * window_width].sum()
b = np.sqrt(b)
print(b)
# [[3.16227766 4.24264069]
# [6.4807407 7.07106781]
# [8.60232527 9.05538514]]
In [2]: ary = np.arange(24).reshape(6,4)
In [3]: ary
Out[3]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]])
While I recommended moving-windows based on as_strided, we can also divide the array into 'blocks' with reshape and transpose:
In [4]: ary.reshape(3,2,2,2).transpose(0,2,1,3)
Out[4]:
array([[[[ 0, 1],
[ 4, 5]],
[[ 2, 3],
[ 6, 7]]],
[[[ 8, 9],
[12, 13]],
[[10, 11],
[14, 15]]],
[[[16, 17],
[20, 21]],
[[18, 19],
[22, 23]]]])
In [5]: np.sqrt(_.sum(axis=(2,3)))
Out[5]:
array([[3.16227766, 4.24264069],
[6.4807407 , 7.07106781],
[8.60232527, 9.05538514]])
While the transpose makes it easier to visual the blocks that need to be summed, it isn't necessary:
In [7]: np.sqrt(ary.reshape(3,2,2,2).sum(axis=(1,3)))
Out[7]:
array([[3.16227766, 4.24264069],
[6.4807407 , 7.07106781],
[8.60232527, 9.05538514]])
np.lib.stride_tricks.sliding_window doesn't give us as much direct control as I thought, but
np.lib.stride_tricks.sliding_window_view(ary,(2,2))[::2,::2]
gives the same result as Out[4].
In [13]: np.sqrt(np.lib.stride_tricks.sliding_window_view(ary,(2,2))[::2,::2].sum(axis=(2,3)))
Out[13]:
array([[3.16227766, 4.24264069],
[6.4807407 , 7.07106781],
[8.60232527, 9.05538514]])
[7] is faster.
In general, it can be done like this:
a_height = 15
a_width = 16
a_area = a_height * a_width
a = np.arange(a_are).reshape(a_height, a_width)
window_height = 3 # must evenly divide a_height
window_width = 4 # must evenly divide a_width
b_height = a_height // window_height
b_width = a_width // window_width
b = a.reshape(b_height, window_height, b_width, window_width).transpose(0,2,1,3)
# or, assuming you want sum or another function that takes `axis` argument
b = a.reshape(b_height, window_height, b_width, window_width).sum(axis=(1,3))

Update all values in a column in DASK array

I have a large dask array containing approx 300 million records and 3 numeric columns
It looks like roughly like (first few records):
2345 947 23
12 234 924
9 8 0
349 276 345
etc...
I would like to add say 100 on to all the values contained in column 2 such that I get the below dask array. Any ideas?
2345 1047 23
12 334 924
9 108 0
349 376 345
etc...
The easiest way might to just be switch it over to a DataFrame and do the assignment there before switching back to an array:
df = darr.to_dask_dataframe(columns=["a", "b", "c"])
df["b"] += 100
darr = df.to_dask_array()
darr.compute()
This also has the benefit of being fairly obvious as to what is happening.
I also took a shot at this using a generalized ufunc -- I couldn't get da.apply_gufunc to work for me in combination with np.add.at but I'm still working to grok ufuncs myself so there's likely a faster or more compact way to do it but this appears to work:
import numpy as np
import dask.array as da
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
def add_at(arr, at, val):
np.add.at(arr, at, val)
return arr
gufunc_add_at = da.gufunc(add_at,
signature="(i),(),()->(i)",
output_dtypes=darr.dtype,
vectorize=True)
gufunc_add_at(darr, 1, 100).compute()
This is a bit clunky but seems to work
import dask.array as da
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
print(darr.compute())
x=darr[:,0].reshape(4,1).compute()
y=(darr[:,1] + 100).reshape(4,1).compute()
z=darr[:,2].reshape(4,1).compute()
t= da.stack([x,y,z], axis=1).reshape(4,3)
t.compute()
Output:
[[2345 947 23]
[ 12 234 924]
[ 9 8 0]
[ 349 276 345]]
array([[2345, 1047, 23],
[ 12, 334, 924],
[ 9, 108, 0],
[ 349, 376, 345]])
This is possibly an improvement to my first answer
from dask.array import from_array, add
from numpy import array
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
vector = from_array(array([[0],[100],[0]]))
add(darr.T, vector).T.compute()
Output
array([[2345, 1047, 23],
[ 12, 334, 924],
[ 9, 108, 0],
[ 349, 376, 345]])

Drop element in numpy array (or pandas series) if difference to previous element is <N

I have a numpy array that looks like that:
a = np.array([0,10,19,20,30,40,42,49,50,51])
I would like to drop all the elements whose consecutive difference is <= 2, eventually keeping
a_filtered = np.array([0,10,19,30,40,49])
How can I do this in numpy? Optionally, special thanks for how to do this in a pandas series (e.g. drop all rows whose index difference is < N)
IIUC
s=pd.Series(a)
s[~(s.diff()<=2)]
Out[289]:
0 0
1 10
2 19
4 30
5 40
7 49
dtype: int32
s[~(s.diff()<=2)].to_numpy()
Out[292]: array([ 0, 10, 19, 30, 40, 49])
Here you go:
N = 2
s = pd.Series(a)
mask = ~s.diff().le(2)
s[mask]
# you can also do
# a[mask]
Output:
1 10
2 19
4 30
5 40
7 49
dtype: int32
On numpy, you may use np.diff and np.insert to specially handle element 0
m = np.insert(np.diff(a, 1) > 2, 0, True)
a[m]
Out[526]: array([ 0, 10, 19, 30, 40, 49])
Or Use np.roll and assign element 0 of the mask to True
m = (a - np.roll(a, 1)) > 2
m[0] = True
a[m]
Out[534]: array([ 0, 10, 19, 30, 40, 49])

How to fetch the entire rows having even numbers in numpy?

I'm trying want to fetch the rows that are having even numbers from the array below:
mat1 = np.array([[23,45,63],[22,78,43],[12,77,47],[53,47,33]]).reshape(4,3)
mat1
array([[23, 45, 63],
[22, 78, 43],
[12, 77, 47],
[53, 47, 33]])
And the below code returns only the values..
mat1[mat1%2==0]
array([22, 78, 12])
Is there any way to fetch the entire row/column having the even numbers?
You can do that like this:
import numpy as np
mat1 = np.array([[23,45,63],[22,78,43],[12,77,47],[53,47,33]])
is_even = (mat1 % 2 == 0)
# Rows
print(mat1[is_even.any(1)])
# [[22 78 43]
# [12 77 47]]
# Columns
print(mat1[:, is_even.any(0)])
# [[23 45]
# [22 78]
# [12 77]
# [53 47]]

Extracting Columns iteratively from 2 different m-by-n matrices and concatenating in a set of n different m-by-2 matrices

MATLAB:
In MATLAB,
I have 2 m-by-n matrices, A and B. I want to make a set of n
m-by-2 matrices such as in ith matrix (of set of n), first column will be ith
column from A and second column will be ith column from B.
How to extract and concatenate ith columns from both matrices?
How I can store these n matrices? Using loops? (Memory?)
Example:
Input:
A = [ 1, 2, 3; 4, 5 ,6; 7, 8, 9] (3x3 matrix)
B = [ 11, 22, 33; 44, 55 ,66; 77, 88, 99] (3x3 matrix)
Output:
For i=1:3
C1 = [1, 11; 4, 44; 7, 77]
C2 = [2, 22; 5, 55; 8, 88]
C3 = [3, 33; 6, 66; 9, 99]
The first thing I'm going to do is change your variable names. Mainly this is just to make referring to the variables easier, especially as m and n change. Instead of writing
C1(:,:)
C2(:,:)
...
Cn(:,:)
I'm going to write
C(:,:,1)
C(:,:,2)
...
C(:,:,n)
All I've done is moved the index from the variable name to the index of the 3rd dimension.
Now, to create the C array:
A = [ 1, 2, 3; 4, 5 ,6; 7, 8, 9]
B = [ 11, 22, 33; 44, 55 ,66; 77, 88, 99]
[m,n]=size(A)
C = reshape([A',B']', m, 2, n)
The output of this is:
A =
1 2 3
4 5 6
7 8 9
B =
11 22 33
44 55 66
77 88 99
m = 3
n = 3
C =
ans(:,:,1) =
1 11
4 44
7 77
ans(:,:,2) =
2 22
5 55
8 88
ans(:,:,3) =
3 33
6 66
9 99
As you can see, C(:,:,1) is equal to C1 in your example, C(:,:,2) = C2 and so on. And this extends without change as the sizes of A and B change. You never have to come up with new variable names. And all you have to do to know how many m-by-2 matrices you've got is
numVars = size(C,3);
Note: This uses the same technique found in the answer here: matlab - how to merge/interlace 2 matrices?

Resources