Split an array into bins of equal numbers - arrays

I have an array (not sorted) of N elements. I'd like to keep the original order of N, but instead of the actual elements, I'd like them to have their bin numbers, where N is split into m bins of equal (if N is divisible by m) or nearly equal (N not divisible by m) values. I need a vectorized solution (since N is fairly large, so standard python methods won't be efficient). Is there anything in scipy or numpy that can do this?
e.g.
N = [0.2, 1.5, 0.3, 1.7, 0.5]
m = 2
Desired output: [0, 1, 0, 1, 0]
I've looked at numpy.histogram, but it doesn't give me unequally spaced bins.

Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted -
Here's the implementation -
def equal_bin(N, m):
sep = (N.size/float(m))*np.arange(1,m+1)
idx = sep.searchsorted(np.arange(N.size))
return idx[N.argsort().argsort()]
Sample runs with bin-counting for each bin to verify results -
In [442]: N = np.arange(1,94)
In [443]: np.bincount(equal_bin(N, 4))
Out[443]: array([24, 23, 23, 23])
In [444]: np.bincount(equal_bin(N, 5))
Out[444]: array([19, 19, 18, 19, 18])
In [445]: np.bincount(equal_bin(N, 10))
Out[445]: array([10, 9, 9, 10, 9, 9, 10, 9, 9, 9])
Here's another approach using linspace to create those equally spaced numbers that could be used as indices, like so -
def equal_bin_v2(N, m):
idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
return idx[N.argsort().argsort()]
Sample run -
In [689]: N
Out[689]: array([ 0.2, 1.5, 0.3, 1.7, 0.5])
In [690]: equal_bin_v2(N,2)
Out[690]: array([0, 1, 0, 1, 0])
In [691]: equal_bin_v2(N,3)
Out[691]: array([0, 1, 0, 2, 1])
In [692]: equal_bin_v2(N,4)
Out[692]: array([0, 2, 0, 3, 1])
In [693]: equal_bin_v2(N,5)
Out[693]: array([0, 3, 1, 4, 2])

pandas.qcut
Another good alternative is the pd.qcut from pandas. For example:
In [6]: import pandas as pd
In [7]: N = [0.2, 1.5, 0.3, 1.7, 0.5]
...: m = 2
In [8]: pd.qcut(N, m, labels=False)
Out[8]: array([0, 1, 0, 1, 0], dtype=int64)
Tip for getting the bin middle points
If you want to return the bin edges, use labels=True (default). This will allow you to get the bin middle points with:
In [26]: intervals = pd.qcut(N, 2)
In [27]: [i.mid for i in intervals]
Out[27]: [0.34950000000000003, 1.1, 0.34950000000000003, 1.1, 0.34950000000000003]
The intervals is an array of pandas.Interval objects (when labels=True).
See also: pd.cut, if you would like to make the bin width (not bin count) equal

Related

How to generate a numpy array with random values that are all different from each other

I've just started to learn about Python libraries and today I got stuck at a numpy exercise.
Let's say I want to generate a 5x5 random array whose values are all different from each other. Further, its values have to range from 0 to 100.
I have already look this up here but found no suitable solution to my problem. Please see the posts I consulted before turning to you:
Numpy: Get random set of rows from 2D array
Numpy Random 2D Array
Here is my attempt to solve the exercise:
import numpy as np
M = np.random.randint(1, 101, size=25)
for x in M:
for y in M:
if x in M == y in M:
M = np.random.randint(1, 101, size=25)
print(M)
By doing this, all I get is a value error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thus, my second attempt has been the following:
M = np.random.randint(1, 101, size=25)
a = M[x]
b = M[y]
for a.any in M:
for b.any in M:
if a.any in M == b.any in M:
M = np.random.randint(1, 101, size=25)
print(M)
Once again, I got an error: AttributeError: 'numpy.int64' object attribute 'any' is read-only.
Could you please let me know what am I doing wrong? I've spent days going over this but nothing else comes to mind :(
Thank you so much!
Not sure if this will be ok for all your needs, but it will work for your example:
np.random.choice(np.arange(100, dtype=np.int32), size=(5, 5), replace=False)
You can use
np.random.random((5,5))
to generate an array of random numbers from 0 to 1, with shape (5,5).
Then just multiply by 100 to get the numbers between 0 and 100:
100*np.random.random((5,5))
Your code gives an error because of this line:
if x in M == y in M:
That syntax doesn't work. And M == y is a comparison between an array and a number, which is why you get ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
You already defined x and y as elements of M in the for loops so just write
if x == y:
While I think choice without replacement is the easy way to go, here's a way implementing your approach. I use np.unique to test whether all the numbers are different.
In [158]: M = np.random.randint(1,101, size=25)
In [159]: x = np.unique(M)
In [160]: x.shape
Out[160]: (23,)
In [161]: cnt = 1
...: while x.shape[0]<25:
...: M = np.random.randint(1,101,size=25)
...: x = np.unique(M)
...: cnt += 1
...:
In [162]: cnt
Out[162]: 34
In [163]: M
Out[163]:
array([ 70, 76, 27, 98, 81, 92, 97, 38, 49, 7, 2, 55, 85,
89, 32, 51, 20, 100, 9, 91, 53, 3, 11, 63, 21])
Note that I had to generate 34 random sets before I got a unique one. That is a lot of work!
unique works by sorting the array, and then looking for adjacent duplicates.
A numpy whole-array approach to testing whether there are any duplicates is:
For this unique set:
In [165]: (M[:,None]==M).sum(axis=0)
Out[165]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
but for another randint array:
In [166]: M1 = np.random.randint(1,101, size=25)
In [168]: (M1[:,None]==M1).sum(axis=0)
Out[168]:
array([1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1])
This M[:,None]==M is roughly the equivalent of
for x in M:
for y in M:
test x==y
I'm using sum to count how many True values there are in each column. any just identifies if there is a True.

Replace elements of numpy array based on first occurrence of a particular value

Suppose there's a numpy 2D array as follows:
>>> x = np.array([[4,2,3,1,1], [1,0,3,2,1], [1,4,4,3,4]])
>>> x
array([[4, 2, 3, 1, 1],
[1, 0, 3, 2, 1],
[1, 4, 4, 3, 4]])
My objective is to - find the first occurrence of value 4 in each row, and set the rest of the elements (including that element) in that row to 0. Hence, after this operation, the transformed array should look like:
>>> x_new
array([[0, 0, 0, 0, 0],
[1, 0, 3, 2, 1],
[1, 0, 0, 0, 0]])
What is the pythonic and optimized way of doing this? I tried with a combination of np.argmax() and np.take() but was unable to achieve the end objective.
You can do it using a cumulative sum across the columns (i.e. axis=1) and boolean indexing:
n = 4
idx = np.cumsum(x == n, axis=1) > 0
x[idx] = 0
or maybe a better way is to do a cumulative (logical) or:
idx = np.logical_or.accumulate(x == n, axis=1)

Fastest way to (arg)sort a flattened nD-array that is sorted along each dimension?

The question itself is language-agnostic. I will use python for my example, mainly because I think it is nice to demonstrate the point.
I have an N-dimensional array of shape (n1, n2, ..., nN) that is contiguous in memory (c-order) and filled with numbers. For each dimension by itself, the numbers are ordered in ascending order. A 2D example of such an array is:
>>> import numpy as np
>>> n1 = np.arange(5)[:, None]
>>> n2 = np.arange(7)[None, :]
>>> n1+n2
array([[ 0, 1, 2, 3, 4, 5, 6],
[ 1, 2, 3, 4, 5, 6, 7],
[ 2, 3, 4, 5, 6, 7, 8],
[ 3, 4, 5, 6, 7, 8, 9],
[ 4, 5, 6, 7, 8, 9, 10]])
In this case, the values in each row are ascending, and the values in each column are ascending, too. A 1D example array is
>>> n1 = np.arange(10)
>>> n1*n1
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81])
I would like to obtain a list/array containing the indices that would sort the flattened version of the nD array in ascending order. By the flattened array I mean that I interpret the nD-array as a 1D array of equivalent size. The sorting doesn't have to preserve order, i.e., the order of indices indexing equal numbers doesn't matter. For example
>>> n1 = np.arange(5)[:, None]
>>> n2 = np.arange(7)[None, :]
>>> arr = n1*n2
>>> arr
array([[ 0, 0, 0, 0, 0, 0, 0],
[ 0, 1, 2, 3, 4, 5, 6],
[ 0, 2, 4, 6, 8, 10, 12],
[ 0, 3, 6, 9, 12, 15, 18],
[ 0, 4, 8, 12, 16, 20, 24]])
>>> np.argsort(arr.ravel())
array([ 0, 28, 14, 7, 6, 21, 4, 3, 2, 1, 5, 8, 9, 15, 22, 10, 11,
29, 16, 12, 23, 17, 13, 18, 30, 24, 19, 25, 31, 20, 26, 32, 27, 33,
34], dtype=int64)
Standard sorting on the flattened array can accomplish this; however, it doesn't exploit the fact that the array is already partially sorted, so I suspect there exists a more efficient solution. What is the most efficient way to do so?
A comment asked what my use-case is, and if I could provide some more realistic test data for benchmarking. Here is how I encountered this problem:
Given an image and a binary mask for that image (which selects pixels), find the largest sub-image which contains only selected pixels.
In my case, I applied a perspective transformation to an image, and want to crop it so that there is no black background while preserving as much of the image as possible.
from skimage import data
from skimage import transform
from skimage import img_as_float
tform = transform.EuclideanTransform(
rotation=np.pi / 12.,
translation = (10, -10)
)
img = img_as_float(data.chelsea())[50:100, 150:200]
tf_img = transform.warp(img, tform.inverse)
tf_mask = transform.warp(np.ones_like(img), tform.inverse)[..., 0]
y = np.arange(tf_mask.shape[0])
x = np.arange(tf_mask.shape[1])
y1 = y[:, None, None, None]
y2 = y[None, None, :, None]
x1 = x[None, :, None, None]
x2 = x[None, None, None, :]
y_padded, x_padded = np.where(tf_mask==0.0)
y_padded = y_padded[None, None, None, None, :]
x_padded = x_padded[None, None, None, None, :]
y_inside = np.logical_and(y1[..., None] <= y_padded, y_padded<= y2[..., None])
x_inside = np.logical_and(x1[..., None] <= x_padded, x_padded<= x2[..., None])
contains_padding = np.any(np.logical_and(y_inside, x_inside), axis=-1)
# size of the sub-image
height = np.clip(y2 - y1 + 1, 0, None)
width = np.clip(x2 - x1 + 1, 0, None)
img_size = width * height
# find all largest sub-images
img_size[contains_padding] = 0
y_low, x_low, y_high, x_high = np.where(img_size == np.max(img_size))
cropped_img = tf_img[y_low[0]:y_high[0]+1, x_low[0]:x_high[0]+1]
The algorithm is quite inefficient; I am aware. What is interesting for this question is img_size, which is a (50,50,50,50) 4D-array that is ordered as described above. Currently I do:
img_size[contains_padding] = 0
y_low, x_low, y_high, x_high = np.where(img_size == np.max(img_size))
but with a proper argsort algorithm (that I can interrupt early) this could potentially be made much better.
I would do it using parts of mergesort and a divide and conquer approach.
You start with the first two arrays.
[0, 1, 2, 3, 4, 5, 6],//<- This
[ 1, 2, 3, 4, 5, 6, 7],//<- This
....
Then you can merge them like this (Java-like syntax):
List<Integer> merged=new ArrayList<>();
List<Integer> firstRow=... //Same would work with arrays
List<Integer> secondRow=...
int firstCnter=0;
int secondCnter=0;
while(firstCnter<firstRow.size()||secondCnter<secondRow.size()){
if(firstCnter==firstRow.size()){ //Unconditionally add all elements from the second, if we added all the elements from the first
merged.add(secondRow.get(secondCnter++));
}else if(secondCnter==secondRow.size()){
merged.add(firstRow.get(firstCnter++));
}else{ //Add the smaller value from both lists at the current index.
int firstValue=firstRow.get(firstCnter);
int secondValue=secondRow.get(secondCnter);
merged.add(Math.min(firstValue,secondValue));
if(firstValue<=secondValue)
firstCnter++;
else
secondCnter++;
}
}
After that you can merge the next two rows, until you have:
[0,1,1,2,2,3,3,4,4,5,5,6,7]
[2,3,3,4,4,5,5,6,6,7,7,8,8,9]
[4,5,6,7,8,9,10] //Not merged.
Continue to merge again.
[0,1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,7,7,7,8,8,9]
[4,5,6,7,8,9,10]
After that, the last merge:
[0,1,1,2,2,2,3,3,3,4,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,9,9,10]
I don't know about the time complexity, but should be a viable solution
Another idea: Use a min-heap with just the current must-have candidates for being the next-smallest value. Start with the value at the origin (index 0 in all dimensions), as that's smallest. Then repeatedly take out the smallest value from the heap and add its neighbors not yet added.

Selecting numpy array elements

I have the task of selecting p% of elements within a given numpy array. For example,
# Initialize 5 x 3 array-
x = np.random.randint(low = -10, high = 10, size = (5, 3))
x
'''
array([[-4, -8, 3],
[-9, -1, 5],
[ 9, 1, 1],
[-1, -1, -5],
[-1, -4, -1]])
'''
Now, I want to select say p = 30% of the numbers in x, so 30% of numbers in x is 5 (rounded up).
Is there a way to select these 30% of numbers in x? Where p can change and the dimensionality of numpy array x can be 3-D or maybe more.
I am using Python 3.7 and numpy 1.18.1
Thanks
You can use np.random.choice to sample without replacement from a 1d numpy array:
p = 0.3
np.random.choice(x.flatten(), int(x.size * p) , replace=False)
For large arrays, the performance of sampling without replacement can be pretty bad, but there are some workarounds.
You can randome choice 0,1 and usenp.nonzero and boolean indexing:
np.random.seed(1)
x[np.nonzero(np.random.choice([1, 0], size=x.shape, p=[0.3,0.7]))]
Output:
array([ 3, -1, 5, 9, -1, -1])
I found a way of selecting p% of numpy elements:
p = 20
# To select p% of elements-
x_abs[x_abs < np.percentile(x_abs, p)]
# To select p% of elements and set them to a value (in this case, zero)-
x_abs[x_abs < np.percentile(x_abs, p)] = 0

How to deal with circle degrees in Numpy?

I need to calculate some direction arrays in numpy. I divided 360 degrees into 16 groups, each group covers 22.5 degrees. I want the 0 degree in the middle of a group, i.e., get directions between -11.25 degrees and 11.25 degrees. But the problem is how can I get the group between 168.75 degrees and -168.75 degrees?
a[numpy.where(a<0)] = a[numpy.where(a<0)]+360
for m in range (0,3600,225):
b = (a*10 > m)-(a*10 >= m+225).astype(float)
c = numpy.apply_over_axes(numpy.sum,b,0)
If you want to divide data into 16 groups, having 0 degree in the middle, why are you writing for m in range (0,3600,225)?
>>> [x/10. for x in range(0,3600,225)]
[0.0, 22.5, 45.0, 67.5, 90.0, 112.5, 135.0, 157.5, 180.0, 202.5, 225.0, 247.5,
270.0, 292.5, 315.0, 337.5]
## this sectors are not the ones you want!
I would say you should start with for m in range (-1125,36000,2250) (note that now I am using a 100 factor instead of 10), that would give you the groups you want...
wind_sectors = [x/100.0 for x in range(-1125,36000,2250)]
for m in wind_sectors:
#DO THINGS
I have to say I don't really understand your script and the goal of it...
To deal with circle degrees, I would suggest something like:
a condition, where you put your problematic data, i.e., the one where you have to deal with the transition around zero;
a condition where you put all the other data.
For example, in this case, I am printing all the elements from my array that belong to each sector:
import numpy
def wind_sectors(a_array, nsect = 16):
step = 360./nsect
init = step/2
sectores = [x/100.0 for x in range(int(init*100),36000,int(step*100))]
a_array[a_array<0] = a_arraya_array[a_array<0]+360
for i, m in enumerate(sectores):
print 'Sector'+str(i)+'(max_threshold = '+str(m)+')'
if i == 0:
for b in a_array:
if b <= m or b > sectores[-1]:
print b
else:
for b in a_array:
if b <= m and b > sectores[i-1]:
print b
return "it works!"
# TESTING IF THE FUNCTION IS WORKING:
a = numpy.array([2,67,89,3,245,359,46,342])
print wind_sectors(a, 16)
# WITH NDARRAYS:
b = numpy.array([[250,31,27,306], [142,54,260,179], [86,93,109,311]])
print wind_sectors(b.flat[:], 16)
about flat and reshape functions:
>>> a = numpy.array([[0,1,2,3], [4,5,6,7], [8,9,10,11]])
>>> original = a.shape
>>> b = a.flat[:]
>>> c = b.reshape(original)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> b
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
>>> c
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Resources