Find and delete all-zero columns from Numpy array using fancy indexing - arrays

How do I find columns in a numpy array that are all-zero and then delete them from the array? I'm looking for a way to both get the column indices and then use those indices to delete.

You could use np.argwhere, with np.all to find your indices. To delete them, use np.delete.
Example:
Find your 0 columns:
a = np.array([[1, 2, 0, 3, 0],
[4, 5, 0, 6, 0],
[7, 8, 0, 9, 0]])
idx = np.argwhere(np.all(a[..., :] == 0, axis=0))
>>> idx
array([[2],
[4]])
Delete your columns
a2 = np.delete(a, idx, axis=1)
>>> a2
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Here is a solution I got
Let say that OriginMat is the matrix with the original data,
And the Result is the matrix I would like to place the result, Then
Result = OriginMat[:,~np.all(OriginMat == 0, axis = 0)]
breaking it down it would be
This check over the column(axis 0) whether or not the values are 0
And negates this value so the columns with zero are taken as false
~np.all(OriginMat == 0, axis = 0)
The resulting matrix would be a vector with False where all elements
are 0 and True when they are not
And the last step just picks the columns that are True(Hence not 0)
I got this solution thanks to the website below:
https://www.science-emergence.com/Articles/How-to-remove-array-rows-that-contain-only-0-in-python/

# Some random array of 1's and 0's
x = np.random.randint(0,2, size=(3, 100))
# Find where all values in the columns are zero
mask = (x == 0).all(0)
# Find the indices of these columns
column_indices = np.where(mask)[0]
# Update x to only include the columns where non-zero values occur.
x = x[:,~mask]

The following works, simplifying #sacuL's anwer:
$ a = np.array([[1, 2, 0, 3, 0],
[4, 5, 0, 6, 0],
[7, 8, 0, 9, 0]])
$ a = a[:, np.any(a, axis=0)]
$ a
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Related

Numpy argsort while distinguishing values of 0

I have a very large array but here I will show a simplified case:
a = np.array([[3, 0, 5, 0], [8, 7, 6, 10], [5, 4, 0, 10]])
array([[ 3, 0, 5, 0],
[ 8, 7, 6, 10],
[ 5, 4, 0, 10]])
I want to argsort() the array but have a way to distinguish 0s. I tried to replace it with NaN:
a = np.array([[3, np.nan, 5, np.nan], [8, 7, 6, 10], [5, 4, np.nan, 10]])
a.argsort()
array([[0, 2, 1, 3],
[2, 1, 0, 3],
[1, 0, 3, 2]])
But the NaNs are still being sorted. Is there any way to make argsort give it a value of -1 or something. Or is there another option other than NaN to replace 0s? I tried math.inf with no success as well. Anybody has any ideas?
The purpose of doing this is that I have a cosine similarity matrix, and I want to exclude those instances where similarities are 0. I am using argsort() to get the highest similarities, which will give me the indices to another table with mappings to labels. If an array's entire similarity is 0 ([0,0,0]), then I want to ignore it. So if I can get argsort() to output it as [-1,-1,-1] after sorting, I can check to see if the entire array is -1 and exclude it.
EDIT:
So output should be:
array([[0, 2, -1, -1],
[2, 1, 0, 3],
[1, 0, 3, -1]])
So when using the last row to refer back to a: the smallest will be a[1], which is 4, followed by a[0], which is 5, then a[3], which is 10, and at last -1, which is the 0
You may want to use numpy.ma.array() like this
a = np.array([[3,4,5],[8,7,6],[5,4,0]])
mask this array with condition a==0,
a_mask = np.ma.array(a, mask=(a==0))
print(a_mask)
# output
masked_array(
data=[[3, 4, 5],
[8, 7, 6],
[5, 4, --]],
mask=[[False, False, False],
[False, False, False],
[False, False, True]],
fill_value=999999)
print(a_mask.mask)
# outputs
array([[False, False, False],
[False, False, False],
[False, False, True]])
and you can use the mask attribute of masked_array to distinguish elements you want to label and fill in other values.
If you mean "distinguish 0s" as the highest value or lowest values, I would suggest trying:
a[a==0]=(a.max()+1)
or:
a[a==0]=(a.min()-1)
One way to achieve the task is to first generate a boolean mask checking for zero values (since you want to distinguish this in the array), then sort it and then use the boolean mask to set the desired values (e.g., -1)
# your unmodified input array
In [294]: a
Out[294]:
array([[3, 4, 5],
[8, 7, 6],
[5, 4, 0]])
# boolean mask checking for zero
In [295]: zero_bool_mask = a == 0
In [296]: zero_bool_mask
Out[296]:
array([[False, False, False],
[False, False, False],
[False, False, True]])
# usual argsort
In [297]: sorted_idxs = np.argsort(a)
In [298]: sorted_idxs
Out[298]:
array([[0, 1, 2],
[2, 1, 0],
[2, 1, 0]])
# replace the indices of 0 with desired value (e.g., -1)
In [299]: sorted_idxs[zero_bool_mask] = -1
In [300]: sorted_idxs
Out[300]:
array([[ 0, 1, 2],
[ 2, 1, 0],
[ 2, 1, -1]])
Following this, to account for the correct sorting indices after the substitution value (e.g., -1), we have to perform this final step:
In [327]: sorted_idxs - (sorted_idxs == -1).sum(1)[:, None]
Out[327]:
array([[ 0, 1, 2],
[ 2, 1, 0],
[ 1, 0, -2]])
So now the sorted_idxs with negative values are the locations where you had zeros in the original array.
Thus, we can have a custom function like so:
def argsort_excluding_zeros(arr, replacement_value):
zero_bool_mask = arr == 0
sorted_idxs = np.argsort(arr)
sorted_idxs[zero_bool_mask] = replacement_value
return sorted_idxs - (sorted_idxs == replacement_value).sum(1)[:, None]
# another array
In [339]: a
Out[339]:
array([[0, 4, 5],
[8, 7, 6],
[5, 4, 0]])
# sample run
In [340]: argsort_excluding_zeros(a, replacement_value=-1)
Out[340]:
array([[-2, 0, 1],
[ 2, 1, 0],
[ 1, 0, -2]])
Using #kmario23 and #ScienceSnake code, I came up with the solution:
a = np.array([[3, 0, 5, 0], [8, 7, 6, 10], [5, 4, 0, 10]])
b = np.where(a == 0, np.inf, a) # Replace 0 -> inf to make them sorted last
s = b.copy() # make a copy of b to sort it
s.sort()
mask = s == np.inf # create a mask to get inf locations after sorting
c = b.argsort()
d = np.where(mask, -1, c) # Replace where the zeros were originally with -1
Out:
array([[ 0, 2, -1, -1],
[ 2, 1, 0, 3],
[ 1, 0, 3, -1]])
Not the most efficient solution because it is sorting twice.....
There might be a slightly more efficient alternative, but this works in pure numpy and is very transparent.
import numpy as np
a = np.array([[3, 0, 5, 0], [8, 7, 6, 10], [5, 4, 0, 10]])
b = np.where(a == 0, np.inf, a) # Replace 0 -> inf to make them sorted last
c = b.argsort()
d = np.where(a == 0, -1, c) # Replace where the zeros were originally with -1
print(d)
outputs
[[ 0 -1 1 -1]
[ 2 1 0 3]
[ 1 0 -1 2]]
To save memory, some of the in-between assignments can be skipped, but I left it this way for clarity.
*** EDIT ***
The OP has clarified exactly what output they want. This is my new solution which has only one sort.
a = np.array([[3, 0, 5, 0], [8, 7, 6, 10], [5, 4, 0, 10]])
b = np.where(a == 0, np.inf, a).argsort()
def remove_invalid_entries(row, num_valid):
row[num_valid.pop():] = -1
return row
num_valid = np.flip(np.count_nonzero(a, 1)).tolist()
b = np.apply_along_axis(remove_invalid_entries, 1, b, num_valid)
print(b)
> [[ 0 2 -1 -1]
[ 2 1 0 3]
[ 1 0 3 -1]]
The start is as before. Then, we go through the argsorted list row by row, and replace the last n elements by -1, where n is the number of 0's that are in the corresponding row of the original list. The fastest way of doing this is with np.apply_along_axis. Here, I counted all the zeros in each row of a, and turn it into a list (reversed order) so that I can use pop() to get the number of elements to keep in the current row of b being iterated over by np.apply_along_axis.

drop np array rows based on element uniqueness and one other condition

Consider the 2d integer array below:
import numpy as np
arr = np.array([[1, 3, 5, 2, 8],
[9, 6, 1, 7, 6],
[4, 4, 1, 8, 0],
[2, 3, 1, 8, 5],
[1, 2, 3, 4, 5],
[6, 6, 7, 9, 1],
[5, 3, 1, 8, 2]])
PROBLEM: Eliminate rows from arr that meet two conditions:
a) The row's elements MUST be unique
b) From these unique-element rows, I want to eliminate the permutation duplicates.
All other rows in arr are kept.
In the example given above, the rows with indices 0,3,4, and 6 meet condition a). Their elements are unique.
Of these 4 rows, the ones with indices 0,3,6 are permutations of each other: I want to keep
one of them, say index 0, and ELIMINATE the other two.
The output would look like:
[[1, 3, 5, 2, 8],
[9, 6, 1, 7, 6],
[4, 4, 1, 8, 0],
[1, 2, 3, 4, 5],
[6, 6, 7, 9, 1]])
I can identify the rows that meet condition a) with something like:
s = np.sort(arr,axis=1)
arr[~(s[:,:-1] == s[:,1:]).any(1)]
But, I'm not sure at all how to eliminate the permutation duplicates.
Here's one way -
# Sort along row
b = np.sort(arr,axis=1)
# Mask of rows with unique elements and select those rows
m = (b[:,:-1] != b[:,1:]).all(1)
d = b[m]
# Indices of uniq rows
idx = np.flatnonzero(m)
# Get indices of rows among them that are unique as per possible permutes
u,stidx,c = np.unique(d, axis=0, return_index=True, return_counts=True)
# Concatenate unique ones among these and non-masked ones
out = arr[np.sort(np.r_[idx[stidx], np.flatnonzero(~m)])]
Alternatively, final step could be optimized further, with something like this -
m[idx[stidx]] = 0
out = arr[~m]

Replace elements of numpy array based on first occurrence of a particular value

Suppose there's a numpy 2D array as follows:
>>> x = np.array([[4,2,3,1,1], [1,0,3,2,1], [1,4,4,3,4]])
>>> x
array([[4, 2, 3, 1, 1],
[1, 0, 3, 2, 1],
[1, 4, 4, 3, 4]])
My objective is to - find the first occurrence of value 4 in each row, and set the rest of the elements (including that element) in that row to 0. Hence, after this operation, the transformed array should look like:
>>> x_new
array([[0, 0, 0, 0, 0],
[1, 0, 3, 2, 1],
[1, 0, 0, 0, 0]])
What is the pythonic and optimized way of doing this? I tried with a combination of np.argmax() and np.take() but was unable to achieve the end objective.
You can do it using a cumulative sum across the columns (i.e. axis=1) and boolean indexing:
n = 4
idx = np.cumsum(x == n, axis=1) > 0
x[idx] = 0
or maybe a better way is to do a cumulative (logical) or:
idx = np.logical_or.accumulate(x == n, axis=1)

Removing submatrix from numpy array by shifting other elements [duplicate]

This question already has answers here:
numpy matrix. move all 0's to the end of each row
(2 answers)
Closed 3 years ago.
Suppose i have a numpy array
a = np.array([[1,2,3,4],
[3,4,5,6],
[2,3,4,4],
[3,3,1,2]])
I want to delete the submatrix [[3,4],[3,1]]. I can do it as follows
mask = np.ones(a.shape,dtype=bool)
mask[2:,1:-1] = False
a_new = a[mask,...]
print(a) #output array([1, 2, 3, 4, 3, 4, 5, 6, 2, 4, 3, 2])
However, i want the output as
np.array([[1,2,3,4],
[3,4,5,6],
[2,4,0,0],
[3,2,0,0]])
I just want numpy to remove the submatrix and shift others elements replacing the empty places with 0. How can i do this?
I cannot find a function that does what you ask, but combining np.roll with a mask with this routine produces your output. Perhaps there is a more elegant way:
a = np.array([[1,2,3,4],
[3,4,5,6],
[2,3,4,4],
[3,3,1,2]])
mask = np.ones(a.shape,dtype=bool)
mask[2:,1:-1] = False
mask2 = mask.copy()
mask2[2:, 1:] = False
n = 2 #shift length
a[~mask2] = np.roll((a * mask)[~mask2],-n)
a
>>array([[1, 2, 3, 4],
[3, 4, 5, 6],
[2, 4, 0, 0],
[3, 2, 0, 0]])
you can simply update those element entries to be zero.
a = np.array([[1,2,3,4],
[3,4,5,6],
[2,3,4,4],
[3,3,1,2]])
a[2:, 2:] = 0
returns
array([[1, 2, 3, 4],
[3, 4, 5, 6],
[2, 3, 0, 0],
[3, 3, 0, 0]])

NumPy: indexing array by list of tuples - how to do it correctly?

I am in the following situation - I have the following:
Multidimensional numpy array a of n dimensions
t, an array of k rows (tuples), each with n elements. In other words, each row in this array is an index in a
What I want: from a, return an array b with k scalar elements, the ith element in b being the result of indexing a with the ith tuple from t.
Seems trivial enough. The following approach, however, does not work
def get(a, t):
# wrong result + takes way too long
return a[t]
I have to resort to doing this iteratively i.e. the following works correctly:
def get(a, t):
res = []
for ind in t:
a_scalar = a
for i in ind:
a_scalar = a_scalar[i]
# a_scalar is now a scalar
res.append(a_scalar)
return res
This works, except for the fact that given that each dimension in a has over 30 elements, the procedure does get really slow when n gets to more than 5. I understand that it would be slow regardless, however, I would like to exploit numpy's capabilities as I believe it would speed up this process considerably.
The key to getting this right is to understand the roles of indexing lists and tuples. Often the two are treated the same, but in numpy indexing, tuples, list and arrays convey different information.
In [1]: a = np.arange(12).reshape(3,4)
In [2]: t = np.array([(0,0),(1,1),(2,2)])
In [4]: a
Out[4]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [5]: t
Out[5]:
array([[0, 0],
[1, 1],
[2, 2]])
You tried:
In [6]: a[t]
Out[6]:
array([[[ 0, 1, 2, 3],
[ 0, 1, 2, 3]],
[[ 4, 5, 6, 7],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[ 8, 9, 10, 11]]])
So what's wrong with it? It ran, but selected a (3,2) array of rows of a. That is, it applied t to just the first dimension, effectively a[t, :]. You want to index on all dimensions, some sort of a[t1, t2]. That's the same as a[(t1,t2)] - a tuple of indices.
In [10]: a[tuple(t[0])] # a[(0,0)]
Out[10]: 0
In [11]: a[tuple(t[1])] # a[(1,1)]
Out[11]: 5
In [12]: a[tuple(t[2])]
Out[12]: 10
or doing all at once:
In [13]: a[(t[:,0], t[:,1])]
Out[13]: array([ 0, 5, 10])
Another way to write it, is n lists (or arrays), one for each dimension:
In [14]: a[[0,1,2],[0,1,2]]
Out[14]: array([ 0, 5, 10])
In [18]: tuple(t.T)
Out[18]: (array([0, 1, 2]), array([0, 1, 2]))
In [19]: a[tuple(t.T)]
Out[19]: array([ 0, 5, 10])
More generally, in a[idx1, idx2] array idx1 is broadcast against idx2 to produce a full selection array. Here the 2 arrays are 1d and match, the selection is your t set of pairs. But the same principle applies to selecting a set of rows and columns, a[ [[0],[2]], [0,2,3] ].
Using the ideas in [10] and following, your get could be sped up with:
In [20]: def get(a, t):
...: res = []
...: for ind in t:
...: res.append(a[tuple(ind)]) # index all dimensions at once
...: return res
...:
In [21]: get(a,t)
Out[21]: [0, 5, 10]
If t really was a list of tuples (as opposed to an array built from them), your get could be:
In [23]: tl = [(0,0),(1,1),(2,2)]
In [24]: [a[ind] for ind in tl]
Out[24]: [0, 5, 10]
Explore using np.ravel_multi_index
Create some test data
arr = np.arange(10**4)
arr.shape=10,10,10,10
t = []
for j in range(5):
t.append( tuple(np.random.randint(10, size = 4)))
print(t)
# [(1, 8, 2, 0),
# (2, 3, 3, 6),
# (1, 4, 8, 5),
# (2, 2, 6, 3),
# (0, 5, 0, 2),]
ta = np.array(t).T
print(ta)
# array([[1, 2, 1, 2, 0],
# [8, 3, 4, 2, 5],
# [2, 3, 8, 6, 0],
# [0, 6, 5, 3, 2]])
arr.ravel()[np.ravel_multi_index(tuple(ta), (10,10,10,10))]
# array([1820, 2336, 1485, 2263, 502]
np.ravel_multi_index basically calculates, from the tuple of input arrays, the index into a flattened array that starts with shape (in this case) (10, 10, 10, 10).
Does this do what you need? Is it fast enough?

Resources