Saving to an empty array of arrays from nested for-loop - arrays

I have an array of arrays filled with zeros, so this is the shape I want for the result.
I'm having trouble saving the nested for-loop to this array of arrays. In other words, I want to replace all of the zeros with what the last line calculates.
percent = []
for i in range(len(F300)):
percent.append(np.zeros(lengths[i]))
for i in range(0,len(Name)):
for j in range(0,lengths[i]):
percent[i][j]=(j+1)/lengths[i]
The last line only saves the last j value for each i.
I'm getting:
percent = [[0,0,1],[0,1],[0,0,0,1]]
but I want:
percent = [[.3,.6,1],[.5,1],[.25,.5,75,1]]

The problem with this code is that because it's in Python 2.7, the / operator is performing "classic" division. There are a couple different approaches to solve this in Python 2.7. One approach is to convert the numbers being divided into floating point numbers:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
percent.append(np.zeros(lengths[i]))
for i in range(0, len(percent)):
for j in range(0, lengths[i]): #
percent[i][j] = float(j + 1) / float(lengths[i])
Another approach would be to import division from the __future__ package. However, this import line must be the first statement in your code.
from __future__ import division
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
percent.append(np.zeros(lengths[i]))
for i in range(0, len(percent)):
for j in range(0, lengths[i]):
percent[i][j] = (j + 1) / lengths[i]
The third approach, and the one I personally prefer, is to make good use of NumPy's built-in functions:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = np.array([np.linspace(1.0 / float(l), 1.0, l) for l in lengths])
All three approaches will produce a list (or in the last case, numpy.ndarray object) of numpy.ndarray objects with the following values:
[[0.33333333, 0.66666667, 1.], [0.5, 1.], [0.25, 0.5, 0.75, 1.]]

Related

Mapping an array to sort it in descending order on Matplotlib chart?

I am trying to build a bar chart with the bars shown in a descending order.
In my code, the numpy array is a result of using SelectKmeans() to select the best features in a machine learning problem depending on their variance.
import numpy as np
import matplotlib.pyplot as plt
flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ]) # this is the numpy.ndarray after running SelectKBest()
print(fimportance) # this gives me 'int_rate', 'fico', 'revol_util', 'inq_last_6mths' as 4 most #important features as their variance values are mapped to flist, e.g. 250 relates to'int_rate' and 218 relates to 'inq_last_6mths'.
[250.14120228 23.95686725 10.71979245 13.38566487 219.41737141
8.19261323 27.69341779 64.96469182 218.77495366 22.7037686 ]
So I want to show these values on my bar chart in descending order, with int_rate on top.
fimportance_sorted = np.sort(fimportance)
fimportance_sorted
array([250.14120228, 219.41737141, 218.77495366, 64.96469182,
27.69341779, 23.95686725, 22.7037686 , 13.38566487,
10.71979245, 8.19261323])
# this bar chart is not right because here the values and indices are messed up.
plt.barh(flist, fimportance_sorted)
plt.show()
Next I have tried this.
plt.barh([x for x in range(len(fimportance))], fimportance)
I understand I need to map these indices to the flist values somehow and then sort them. Maybe by creating an array and then mapping my list labels instead of its index. here I am stuck.
for i,v in enumerate(fimportance):
arr = np.array([i,v])
.....
Thank you for your help with this problem.
the values and indices are messed up
That's because you sorted fimportance (fimportance_sorted = np.sort(fimportance)), but the order of labels in flist remained unchanged, so now labels don't correspond to the values in fimportance_sorted.
You can use numpy.argsort to get the indices that would put fimportance into sorted order and then index both flist and fimportance with these indices:
>>> import numpy as np
>>> flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
>>> fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
... 8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ])
>>> idx = np.argsort(fimportance)
>>> idx
array([5, 2, 3, 9, 1, 6, 7, 8, 4, 0])
>>> flist[idx]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
>>> np.array(flist)[idx]
array(['days_with_cr_line', 'log_annual_inc', 'dti', 'pub_rec',
'installment', 'revol_bal', 'revol_util', 'inq_last_6mths', 'fico',
'int_rate'], dtype='<U17')
>>> fimportance[idx]
array([ 8.19261323, 10.71979245, 13.38566487, 22.7037686 ,
23.95686725, 27.69341779, 64.96469182, 218.77495366,
219.41737141, 250.14120228])
idx is the order in which you need to put elements of fimportance to sort it. The order of flist must match the order of fimportance, so index both with idx.
As a result, elements of np.array(flist)[idx] correspond to elements of fimportance[idx].

Plotting a list vs a list of arrays with matplotlib

Let's say I have two lists a and b, whereas one is a list of arrays
a = [1200, 1400, 1600, 1800]
b = [array([ 1.84714754, 4.94204658, 11.61580355, ..., 17.09772144,
17.09537562, 17.09499705]), array([ 3.08541849, 5.11338795, 10.26957508, ..., 16.90633304,
16.90417909, 16.90458781]), array([ 4.61916789, 4.58351918, 4.37590053, ..., -2.76705271,
-2.46715664, -1.94577492]), array([7.11040853, 7.79529924, 8.48873734, ..., 7.78736448, 8.47749987,
9.36040364])]
The shape of both is said to be (4,)
If I now try to plot these via plt.scatter(a, b)
I get an error I can't relate to: ValueError: setting an array element with a sequence.
At the end I want a plot where per n-th value in a a set of values stored as n-th array in b shall be plotted.
I'm pretty sure I've done this before, but I can't get this working.
Any ideas? ty
You need to adjust the elements in a to match the elements in b
len_b = [len(sub_array) for sub_array in b]
a = [repeat_a for i,repeat_a in enumerate(a) for _ in range(len_b[i])]
# convert list of array to just list of values
b = np.ravel(b).tolist()
# check if lengths are same
assert len(a) == len(b)
# if yes, now this should work
plt.scatter(a,b)
I am afraid repetition it is. If all lists in b have the same length, you can use numpy.repeat:
import numpy as np
import matplotlib.pyplot as plt
#fake data
np.random.seed(123)
a = [1200, 1400, 1600, 1800]
b = np.random.randint(1, 100, (4, 11)).tolist()
plt.scatter(np.repeat(a, len(b[0])), b)
plt.show()
If you are not sure and want to be on the safe side, list comprehension it is.
import numpy as np
import matplotlib.pyplot as plt
#fake data
np.random.seed(123)
a = [1200, 1400, 1600, 1800]
b = np.random.randint(1, 100, (4, 11)).tolist()
plt.scatter([[x]*len(b[i]) for i, x in enumerate(a)], b)
plt.show()
The output is the same:
Referring to the suggestion of #sai I tried
import numpy as np
arr0 = np.array([1, 2, 3, 4, 5])
arr1 = np.array([6, 7, 8, 9])
arr2 = np.array([10, 11])
old_b = [arr0, arr1, arr2]
b = np.ravel(old_b).tolist()
print(len(b))
Which will give me length 3 instead of the length 11 I expected. How can I collapse a list of arrays to a single list?
edit:
b = np.concatenate(old_b).ravel().tolist()
will lead to the desired result. Thanks all.

How to get a sub-shape of an array in Python?

Not sure the title is correct, but I have an array with shape (84,84,3) and I need to get subset of this array with shape (84,84), excluding that third dimension.
How can I accomplish this with Python?
your_array[:,:,0]
This is called slicing. This particular example gets the first 'layer' of the array. This assumes your subshape is a single layer.
If you are using numpy arrays, using slices would be a standard way of doing it:
import numpy as np
n = 3 # or any other positive integer
a = np.empty((84, 84, n))
i = 0 # i in [0, n]
b = a[:, :, i]
print(b.shape)
I recommend you have a look at this.

Find common elements in subarrays of arrays

I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).
for example:
arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]
arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]
I want to find common coordinates (same first 3 elements):
list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]
At the moment I'm doing this double loop, which is extremely slow:
list_arr=[]
for i in arr1:
for j in arr2:
if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
list_arr.append (j)
I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.
Can you help me with this?
Approach #1
Here's a vectorized one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]
Approach #2
Another with dimensionality-reduction for ints -
d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]
Improvement #1
We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -
unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]
Improvement #2
For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]
You can do it with the help of set
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])
# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]
aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])
Output
array([[7, 8, 9],
[1, 2, 3]])
Edit
Use list comprehension
l = [list(i) for i in arr2 if i[:3] in arr]
print(l)
Output:
[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]
For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:
1.+1e-15==1.
False
1.+1e-16==1.
True
If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.
import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
tree=spatial.cKDTree(arr2[:,0:3])
#You can check here if the distance is small enough and otherwise raise an error
dist,ind=tree.query(arr1, k=1)
return ind

Looping through slices of Theano tensor

I have two 2D Theano tensors, call them x_1 and x_2, and suppose for the sake of example, both x_1 and x_2 have shape (1, 50). Now, to compute their mean squared error, I simply run:
T.sqr(x_1 - x_2).mean(axis = -1).
However, what I wanted to do was construct a new tensor that consists of their mean squared error in chunks of 10. In other words, since I'm more familiar with NumPy, what I had in mind was to create the following tensor M in Theano:
M = [theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1) for i in xrange(0, 50, 10)]
Now, since Theano doesn't have for loops, but instead uses scan (which map is a special case of), I thought I would try the following:
sequence = T.arange(0, 50, 10)
M = theano.map(lambda i: theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1), sequence)
However, this does not seem to work, as I get the error:
only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Is there a way to loop through the slices using theano.scan (or map)? Thanks in advance, as I'm new to Theano!
Similar to what can be done in numpy, a solution would be to reshape your (1, 50) tensor to a (1, 10, 5) tensor (or even a (10, 5) tensor), and then to compute the mean along the second axis.
To illustrate this with numpy, suppose I want to compute means by slices of 2
x = np.array([0, 2, 0, 4, 0, 6])
x = x.reshape([3, 2])
np.mean(x, axis=1)
outputs
array([ 1., 2., 3.])

Resources