Select rows by minimum values of a column considering unique values of another column (numpy array) - arrays

I want to select only the rows for each unique value of a column (first column) that have a minimum value in another column (second column).
How can I do it?
Let's say I have this array:
[[10, 1], [10, 5], [10, 2], [20, 4], [20, 1], [20, 7], [20, 2], [40, 7], [40, 4], [40, 5]]
I would like to obtain the following array:
[[10, 1], [20, 1], [40, 4]]
I was trying selecting rows in this way:
d = {i: array[array[:, 0] == i] for i in np.unique(array[:, 0])}
but then I dont't know how to detect the one with minimum value in the second row.

What you want is the idea of groupby, as implemented in pandas for instance. As we don't have that in numpy, let's implement something similar to this other answer.
Let's call your input array A. So first, sort the rows by the values in the first column. We do this so that all entries with the same value appear one after the other.
sor = A[A[:,0].argsort()]
And get the indices where new unique values are found.
uniq=np.unique(sor[:,0],return_index=True)[1]
print(uniq)
>>> array([0, 3, 7])
This indicates the places of the array where we need to cut to get groups. Now split the second column into such groups. That way you get chunks of elements of the second column, grouped by the elements on the first column.
grp=np.split(sor[:,1],uni[1:])
print(grp)
>>> [array([1, 5, 2]), array([4, 1, 7, 2]), array([7, 4, 5])]
Last step is to get the index of the minimum value out of each of these groups
ind=np.array(list(map(np.argmin,grp))) + uni
print(ind)
>>> array([0, 4, 8])
The first part maps the np.argmin function to every group in grp. The + uniq part is there for mapping every one of these minimum arguments into the original scale.
Now you only need to index your sorted array using these indices.
print(sor[ind])
>>> array([[10, 1],
[20, 1],
[40, 4]])

Related

drop np array rows based on element uniqueness and one other condition

Consider the 2d integer array below:
import numpy as np
arr = np.array([[1, 3, 5, 2, 8],
[9, 6, 1, 7, 6],
[4, 4, 1, 8, 0],
[2, 3, 1, 8, 5],
[1, 2, 3, 4, 5],
[6, 6, 7, 9, 1],
[5, 3, 1, 8, 2]])
PROBLEM: Eliminate rows from arr that meet two conditions:
a) The row's elements MUST be unique
b) From these unique-element rows, I want to eliminate the permutation duplicates.
All other rows in arr are kept.
In the example given above, the rows with indices 0,3,4, and 6 meet condition a). Their elements are unique.
Of these 4 rows, the ones with indices 0,3,6 are permutations of each other: I want to keep
one of them, say index 0, and ELIMINATE the other two.
The output would look like:
[[1, 3, 5, 2, 8],
[9, 6, 1, 7, 6],
[4, 4, 1, 8, 0],
[1, 2, 3, 4, 5],
[6, 6, 7, 9, 1]])
I can identify the rows that meet condition a) with something like:
s = np.sort(arr,axis=1)
arr[~(s[:,:-1] == s[:,1:]).any(1)]
But, I'm not sure at all how to eliminate the permutation duplicates.
Here's one way -
# Sort along row
b = np.sort(arr,axis=1)
# Mask of rows with unique elements and select those rows
m = (b[:,:-1] != b[:,1:]).all(1)
d = b[m]
# Indices of uniq rows
idx = np.flatnonzero(m)
# Get indices of rows among them that are unique as per possible permutes
u,stidx,c = np.unique(d, axis=0, return_index=True, return_counts=True)
# Concatenate unique ones among these and non-masked ones
out = arr[np.sort(np.r_[idx[stidx], np.flatnonzero(~m)])]
Alternatively, final step could be optimized further, with something like this -
m[idx[stidx]] = 0
out = arr[~m]

How can I create a function that combines list/array rows/columns/elements in arbitrary sized array/list?

Afternoon. I'm currently trying to create a function(s) that, when given an array or list and a specified selection of columns/rows/elements, the specified columns/rows/etc are removed and concatenated into a array/list-much in this fashion (but for arbitrary sized objects that may or may not be pretty big)
a = [1 2 3 b=['a','b','c'
4 5 6 'd','e','f'
7 8 9] 'g','h','i']
Now, let's say I want the 1st, and third columns. Then this would look like
a'=[1 3 b'=['a', 'c'
4 6 'd', 'f'
7 9] 'g', 'i]
I'm familiar with slicing indices and extracting them using numpy-so I guess where I'm really hung up is creating some object (a list or array of arrays/lists?) that contains columns/whatever (in the above i choose the first and third columns, as you can see) and then iterating over that object to create a concatenated/combined list of what I've specified(i.e.-If I'm given an array with 127 variables and I want to exact an arbitrary amount of arbitrary columns at a given time)
Thanks for taking a look. Let me know how to update the op if anything is unclear.
How is this different from advanced indexing
In [324]: A = np.arange(12).reshape(2,6)
In [325]: A
Out[325]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
In [326]: A[:,[1,2,4]]
Out[326]:
array([[ 1, 2, 4],
[ 7, 8, 10]])
To select both rows and columns you have to pay attention to index broadcasting:
In [327]: A = np.arange(24).reshape(4,6)
In [328]: A[[[1],[3]], [1,2,4]] # column index and row index
Out[328]:
array([[ 7, 8, 10],
[19, 20, 22]])
In [329]: A[np.ix_([1,3], [1,2,4])] # easier with ix_()
Out[329]:
array([[ 7, 8, 10],
[19, 20, 22]])
https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#purely-integer-array-indexing
The index arrays/lists can be assigned to variables - the input the the A indexing can be a tuple.
In [330]: idx = [[1,3],[1,2,4]]
In [331]: idx1 = np.ix_(*idx)
In [332]: idx1
Out[332]:
(array([[1],
[3]]), array([[1, 2, 4]]))
In [333]: A[idx1]
Out[333]:
array([[ 7, 8, 10],
[19, 20, 22]])
And to expand a set of slices and indices into single array, np.r_ is handy (though not magical):
In [335]: np.r_[slice(0,5),7,6, 3:6]
Out[335]: array([0, 1, 2, 3, 4, 7, 6, 3, 4, 5])
There are other indexing tools, utilities in indexing_tricks, functions like np.delete and np.take.
Try np.source(np.delete) to see how that handles general purpose deletion.
You could use a double list comprehension
>>> def select(arr, rows, cols):
... return [[el for j, el in enumerate(row) if j in cols] for i, row in enumerate(arr) if i in rows]
...
>>> select([[1,2,3,4],[5,6,7,8],[9,10,11,12]],(0,2),(1,3))
[[2, 4], [10, 12]]
>>>
please note that, independent of the order of indices in rows and cols,
select doesn't reorder the rows and columns of the input, note also that
using the same index repeatedly in either rows or cols does not give you duplicated rows or columns. Eventually note that select works only for lists of lists.
That said I advise you in favor of numpy that's hugely more flexible and
extremely more efficient.

Sort array based on frequency

How can I sort an array by most repetitive values.?
suppose I have an array [3, 3, 3, 3, 4, 4]
Expected the result as [3, 4] since 3 is most repeated and 4 is least repeated.
Is there any way too do it?
Thanks in advance....!
Here is one way of doing it:
distictList: Get all distinct values from the array and store in this
countArray: For each ith index in distinctList countArray[i] holds the occurrence of the distinctList[i]
Now sort countArray and apply same swaps on the distinctList simultaneously.
Ex: [3, 3, 4, 4, 4]
distinctList [3,4]
countArray [2,3]
Descending sort countArray [3,2] sorting distinctList at the same time [4,3]
Output: [4, 3]`
Simple in Python:
data = [3, 2, 3, 4, 2, 1, 3]
frequencies = {x:0 for x in data}
for x in data:
frequencies[x] = frequencies[x] + 1
sorted_with_repetitions = sorted(data, key=lambda x:frequencies[x],reverse=True)
sorted_without_repetitions = sorted(frequencies.keys(), key=lambda x:frequencies[x],reverse=True)
print(data)
print(sorted_with_repetitions)
print(sorted_without_repetitions)
print(frequencies)
The same approach (an associative container to collect distinct values and count occurrences, used in a custom comparison to sort an array with the original data or only distinct items) is suitable for Java.

numpy using multidimensional index array on another multidimensional array

I have a 2 multidimensional arrays, and I'd like to use one as the index to produce a new multidimensional array. For example:
a = array([[4, 3, 2, 5],
[7, 8, 6, 8],
[3, 1, 5, 6]])
b = array([[0,2],[1,1],[3,1]])
I want to use the first array in b to return those indexed elements in the first array of a, and so on. So I want the output to be:
array([[4,2],[8,8],[6,1]])
This is probably simple but I couldn't find an answer by searching. Thanks.
This is a little tricky, but the following will do it:
>>> a[np.arange(3)[:, np.newaxis], b]
array([[4, 2],
[8, 8],
[6, 1]])
You need to index both the rows and the columns of the a array, so to match your b array you would need an array like this:
rows = np.array([[0, 0],
[1, 1],
[2, 2]])
And then a[rows, b] would clearly return what you are after. You can get the same result relying on broadcasting as above, replacing the rows array with np.arange(3)[:, np.newaxis], which is equivalent to np.arange(3).reshape(3, 1).

How to extract lines in an array, which contain a certain value? (numpy, scipy)

I have an numpy 2D array and I want it to return coloumn c where (r, c-1) (row r, coloumn c) equals a certain value (int n).
I don't want to iterate over the rows writing something like
for r in len(rows):
if array[r, c-1] == 1:
store array[r,c]
, because there are 4000 of them and this 2D array is just one of 20 i have to look trough.
I found "filter" but don't know how to use it (Found no doc).
Is there an function, that provides such a search?
I hope I understood your question correctly. Let's say you have an array a
a = array(range(7)*3).reshape(7, 3)
print a
array([[0, 1, 2],
[3, 4, 5],
[6, 0, 1],
[2, 3, 4],
[5, 6, 0],
[1, 2, 3],
[4, 5, 6]])
and you want to extract all lines where the first entry is 2. This can be done like this:
print a[a[:,0] == 2]
array([[2, 3, 4]])
a[:,0] denotes the first column of the array, == 2 returns a Boolean array marking the entries that match, and then we use advanced indexing to extract the respective rows.
Of course, NumPy needs to iterate over all entries, but this will be much faster than doing it in Python.
Numpy arrays are not indexed. If you need to perform this specific operation more effeciently than linear in the array size, then you need to use something other than numpy.

Resources