Plotting a list vs a list of arrays with matplotlib - arrays

Let's say I have two lists a and b, whereas one is a list of arrays
a = [1200, 1400, 1600, 1800]
b = [array([ 1.84714754, 4.94204658, 11.61580355, ..., 17.09772144,
17.09537562, 17.09499705]), array([ 3.08541849, 5.11338795, 10.26957508, ..., 16.90633304,
16.90417909, 16.90458781]), array([ 4.61916789, 4.58351918, 4.37590053, ..., -2.76705271,
-2.46715664, -1.94577492]), array([7.11040853, 7.79529924, 8.48873734, ..., 7.78736448, 8.47749987,
9.36040364])]
The shape of both is said to be (4,)
If I now try to plot these via plt.scatter(a, b)
I get an error I can't relate to: ValueError: setting an array element with a sequence.
At the end I want a plot where per n-th value in a a set of values stored as n-th array in b shall be plotted.
I'm pretty sure I've done this before, but I can't get this working.
Any ideas? ty

You need to adjust the elements in a to match the elements in b
len_b = [len(sub_array) for sub_array in b]
a = [repeat_a for i,repeat_a in enumerate(a) for _ in range(len_b[i])]
# convert list of array to just list of values
b = np.ravel(b).tolist()
# check if lengths are same
assert len(a) == len(b)
# if yes, now this should work
plt.scatter(a,b)

I am afraid repetition it is. If all lists in b have the same length, you can use numpy.repeat:
import numpy as np
import matplotlib.pyplot as plt
#fake data
np.random.seed(123)
a = [1200, 1400, 1600, 1800]
b = np.random.randint(1, 100, (4, 11)).tolist()
plt.scatter(np.repeat(a, len(b[0])), b)
plt.show()
If you are not sure and want to be on the safe side, list comprehension it is.
import numpy as np
import matplotlib.pyplot as plt
#fake data
np.random.seed(123)
a = [1200, 1400, 1600, 1800]
b = np.random.randint(1, 100, (4, 11)).tolist()
plt.scatter([[x]*len(b[i]) for i, x in enumerate(a)], b)
plt.show()
The output is the same:

Referring to the suggestion of #sai I tried
import numpy as np
arr0 = np.array([1, 2, 3, 4, 5])
arr1 = np.array([6, 7, 8, 9])
arr2 = np.array([10, 11])
old_b = [arr0, arr1, arr2]
b = np.ravel(old_b).tolist()
print(len(b))
Which will give me length 3 instead of the length 11 I expected. How can I collapse a list of arrays to a single list?
edit:
b = np.concatenate(old_b).ravel().tolist()
will lead to the desired result. Thanks all.

Related

Mapping an array to sort it in descending order on Matplotlib chart?

I am trying to build a bar chart with the bars shown in a descending order.
In my code, the numpy array is a result of using SelectKmeans() to select the best features in a machine learning problem depending on their variance.
import numpy as np
import matplotlib.pyplot as plt
flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ]) # this is the numpy.ndarray after running SelectKBest()
print(fimportance) # this gives me 'int_rate', 'fico', 'revol_util', 'inq_last_6mths' as 4 most #important features as their variance values are mapped to flist, e.g. 250 relates to'int_rate' and 218 relates to 'inq_last_6mths'.
[250.14120228 23.95686725 10.71979245 13.38566487 219.41737141
8.19261323 27.69341779 64.96469182 218.77495366 22.7037686 ]
So I want to show these values on my bar chart in descending order, with int_rate on top.
fimportance_sorted = np.sort(fimportance)
fimportance_sorted
array([250.14120228, 219.41737141, 218.77495366, 64.96469182,
27.69341779, 23.95686725, 22.7037686 , 13.38566487,
10.71979245, 8.19261323])
# this bar chart is not right because here the values and indices are messed up.
plt.barh(flist, fimportance_sorted)
plt.show()
Next I have tried this.
plt.barh([x for x in range(len(fimportance))], fimportance)
I understand I need to map these indices to the flist values somehow and then sort them. Maybe by creating an array and then mapping my list labels instead of its index. here I am stuck.
for i,v in enumerate(fimportance):
arr = np.array([i,v])
.....
Thank you for your help with this problem.
the values and indices are messed up
That's because you sorted fimportance (fimportance_sorted = np.sort(fimportance)), but the order of labels in flist remained unchanged, so now labels don't correspond to the values in fimportance_sorted.
You can use numpy.argsort to get the indices that would put fimportance into sorted order and then index both flist and fimportance with these indices:
>>> import numpy as np
>>> flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
>>> fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
... 8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ])
>>> idx = np.argsort(fimportance)
>>> idx
array([5, 2, 3, 9, 1, 6, 7, 8, 4, 0])
>>> flist[idx]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
>>> np.array(flist)[idx]
array(['days_with_cr_line', 'log_annual_inc', 'dti', 'pub_rec',
'installment', 'revol_bal', 'revol_util', 'inq_last_6mths', 'fico',
'int_rate'], dtype='<U17')
>>> fimportance[idx]
array([ 8.19261323, 10.71979245, 13.38566487, 22.7037686 ,
23.95686725, 27.69341779, 64.96469182, 218.77495366,
219.41737141, 250.14120228])
idx is the order in which you need to put elements of fimportance to sort it. The order of flist must match the order of fimportance, so index both with idx.
As a result, elements of np.array(flist)[idx] correspond to elements of fimportance[idx].

Indexing 3D arrays with Numpy

I have an array in three dimensions (x, y, z) and an indexing vector. This vector has a size equal to the dimension x of the array. Its objective is to index a specific y bringing their respective z, i.e., the expected result has dimension (x, z).
I wrote a code that works as expected, but does anyone know if a Numpy function can replace the for loop and solve the problem more optimally?
arr = np.random.rand(100,5,2)
result = np.random.rand(100,2)
id = [np.random.randint(0, 5) for _ in range(100)]
for i in range(100):
result[i] = arr[i,id[i]]
You can achieve this with this piece of code:
import numpy as np
arr = np.random.randn(100, 5, 2)
ids = np.random.randint(0, 5, size=100)
res = arr[range(100), ids]
res.shape # (100, 2)

How to effectively generate an array of tuples using numpy [duplicate]

This question already has answers here:
Combinations from range of values for given sizes
(3 answers)
Closed 3 years ago.
I would like to effectively generate a numpy array of tuples which size is the multiple of the dimensions of each axis using numpy.arange() and exclusively using numpy functions. For example: the size of a_list below is max_i*max_j*max_k.
Moreover, the array that I would like to obtain for the example below looks like this : [(0,0,0), (0,0,1), ..., (0, 0, 9), (0, 1, 0), (0, 1, 1), ..., (9, 4, 14)]
a_list = list()
max_i = 10
max_j = 5
max_k = 15
for i in range(0, max_i):
for j in range(0, max_j):
for k in range(0, max_k):
a_list.append((i, j, k))
The loop's complexity above, relying on list and for loops, is O(max_i*max_j*max_k), I would like to use a factorized way to generate a lookalike array of tuples in numpy. Is it possible ?
I like Divakar's solution in the comments better, but here's another.
What you're describing is a cartesian product. With some help from this post, you can achieve this as follows
import numpy as np
# Input
max_i, max_j, max_k = (10, 5, 15)
# Build sequence arrays 0, 1, ... N
arr_i = np.arange(0, max_i)
arr_j = np.arange(0, max_j)
arr_k = np.arange(0, max_k)
# Build cartesian product of sequence arrays
grid = np.meshgrid(arr_i, arr_j, arr_k)
cartprod = np.stack(grid, axis=-1).reshape(-1, 3)
# Convert to list of tuples
result = list(map(tuple, cartprod))

Find common elements in subarrays of arrays

I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).
for example:
arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]
arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]
I want to find common coordinates (same first 3 elements):
list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]
At the moment I'm doing this double loop, which is extremely slow:
list_arr=[]
for i in arr1:
for j in arr2:
if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
list_arr.append (j)
I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.
Can you help me with this?
Approach #1
Here's a vectorized one with views -
# https://stackoverflow.com/a/45313353/ #Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]
Approach #2
Another with dimensionality-reduction for ints -
d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]
Improvement #1
We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -
unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]
Improvement #2
For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]
You can do it with the help of set
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])
# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]
aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])
Output
array([[7, 8, 9],
[1, 2, 3]])
Edit
Use list comprehension
l = [list(i) for i in arr2 if i[:3] in arr]
print(l)
Output:
[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]
For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:
1.+1e-15==1.
False
1.+1e-16==1.
True
If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.
import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
tree=spatial.cKDTree(arr2[:,0:3])
#You can check here if the distance is small enough and otherwise raise an error
dist,ind=tree.query(arr1, k=1)
return ind

Saving to an empty array of arrays from nested for-loop

I have an array of arrays filled with zeros, so this is the shape I want for the result.
I'm having trouble saving the nested for-loop to this array of arrays. In other words, I want to replace all of the zeros with what the last line calculates.
percent = []
for i in range(len(F300)):
percent.append(np.zeros(lengths[i]))
for i in range(0,len(Name)):
for j in range(0,lengths[i]):
percent[i][j]=(j+1)/lengths[i]
The last line only saves the last j value for each i.
I'm getting:
percent = [[0,0,1],[0,1],[0,0,0,1]]
but I want:
percent = [[.3,.6,1],[.5,1],[.25,.5,75,1]]
The problem with this code is that because it's in Python 2.7, the / operator is performing "classic" division. There are a couple different approaches to solve this in Python 2.7. One approach is to convert the numbers being divided into floating point numbers:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
percent.append(np.zeros(lengths[i]))
for i in range(0, len(percent)):
for j in range(0, lengths[i]): #
percent[i][j] = float(j + 1) / float(lengths[i])
Another approach would be to import division from the __future__ package. However, this import line must be the first statement in your code.
from __future__ import division
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = []
for i in range(3): # Deduced size of F300 from the length of percent.
percent.append(np.zeros(lengths[i]))
for i in range(0, len(percent)):
for j in range(0, lengths[i]):
percent[i][j] = (j + 1) / lengths[i]
The third approach, and the one I personally prefer, is to make good use of NumPy's built-in functions:
import numpy as np
lengths = [3, 2, 4] # Deduced values of lengths from your output.
percent = np.array([np.linspace(1.0 / float(l), 1.0, l) for l in lengths])
All three approaches will produce a list (or in the last case, numpy.ndarray object) of numpy.ndarray objects with the following values:
[[0.33333333, 0.66666667, 1.], [0.5, 1.], [0.25, 0.5, 0.75, 1.]]

Resources