Python ValueError: setting an array element with a sequence. while using SVM in scikit-learn - arrays

I have been working on scikit-learn SVMs for a binary classification problem. I have calculated the features of images and store it in array. This is how each row in a array looks like:
[variable(0.16749821603298187) variable(0.15862827003002167)
variable(0.15818320214748383) ..., variable(0.2765314280986786)
variable(0.2909393608570099) variable(0.2909393608570099)]
shape of X_train_svm is (6, 7290) and Y_train is (6,)
So when I print X_train_svm and Y_train I get exact values in an array. But when I use
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
classifier=SVC(kernel='linear',random_state=0)
classifier.fit(X_train_svm,Y_train)
get the error saying
ValueError Traceback (most recent call last)
<ipython-input-145-a957b86fe2dc> in <module>
2 from sklearn.metrics import accuracy_score
3 classifier=SVC(kernel='linear',random_state=0)
----> 4 classifier.fit(X_train_svm,Y_train)
c:\users\s121293.squ\appdata\local\programs\python\python35\lib\site-
packages\numpy\core\numeric.py in asarray(a, dtype, order)
480
481 """
--> 482 return array(a, dtype, copy=False, order=order)
483
484 def asanyarray(a, dtype=None, order=None):
ValueError: setting an array element with a sequence.
Can someone help me as to what I can do now? I am really not sure what is happening inside. Both the dimensions of X_train and Y_train are same.
Note: I have a feeling that something might be wrong while I convert the object to a numpy array. Thanks in advance.
Edit: X_Train_svm is looking like the following
[[variable(0.16749821603298187) variable(0.15862827003002167)
variable(0.15818320214748383) ..., variable(0.2765314280986786)
variable(0.2909393608570099) variable(0.2909393608570099)]
..............................................................
[variable(0.22378747165203094) variable(0.22378747165203094)
variable(0.20569562911987305) ..., variable(0.29241225123405457)
variable(0.31552478671073914) variable(0.31552478671073914)]]
y_train is the label
[0 0 0 1 1 1]
i have used the following code to convert the features in fully-connected layer for SVM classifier
X_train_SVM=Fc1_output
print(Y_train)
print(X_train_SVM.shape)
Y_train_svm=np.reshape(Y_train,(6,1))
####### SVM ######################
clf = SVC(gamma=0.01,C=10,kernel='poly')
clf.fit(X_train_SVM,Y_train_svm)
And all my images are of same size ie i resized to 224x224

Related

Mapping an array to sort it in descending order on Matplotlib chart?

I am trying to build a bar chart with the bars shown in a descending order.
In my code, the numpy array is a result of using SelectKmeans() to select the best features in a machine learning problem depending on their variance.
import numpy as np
import matplotlib.pyplot as plt
flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ]) # this is the numpy.ndarray after running SelectKBest()
print(fimportance) # this gives me 'int_rate', 'fico', 'revol_util', 'inq_last_6mths' as 4 most #important features as their variance values are mapped to flist, e.g. 250 relates to'int_rate' and 218 relates to 'inq_last_6mths'.
[250.14120228 23.95686725 10.71979245 13.38566487 219.41737141
8.19261323 27.69341779 64.96469182 218.77495366 22.7037686 ]
So I want to show these values on my bar chart in descending order, with int_rate on top.
fimportance_sorted = np.sort(fimportance)
fimportance_sorted
array([250.14120228, 219.41737141, 218.77495366, 64.96469182,
27.69341779, 23.95686725, 22.7037686 , 13.38566487,
10.71979245, 8.19261323])
# this bar chart is not right because here the values and indices are messed up.
plt.barh(flist, fimportance_sorted)
plt.show()
Next I have tried this.
plt.barh([x for x in range(len(fimportance))], fimportance)
I understand I need to map these indices to the flist values somehow and then sort them. Maybe by creating an array and then mapping my list labels instead of its index. here I am stuck.
for i,v in enumerate(fimportance):
arr = np.array([i,v])
.....
Thank you for your help with this problem.
the values and indices are messed up
That's because you sorted fimportance (fimportance_sorted = np.sort(fimportance)), but the order of labels in flist remained unchanged, so now labels don't correspond to the values in fimportance_sorted.
You can use numpy.argsort to get the indices that would put fimportance into sorted order and then index both flist and fimportance with these indices:
>>> import numpy as np
>>> flist = ['int_rate', 'installment', 'log_annual_inc','dti', 'fico', 'days_with_cr_line', 'revol_bal', 'revol_util', 'inq_last_6mths','pub_rec']
>>> fimportance = np.array([250.14120228,23.95686725,10.71979245,13.38566487,219.41737141,
... 8.19261323,27.69341779,64.96469182,218.77495366,22.7037686 ])
>>> idx = np.argsort(fimportance)
>>> idx
array([5, 2, 3, 9, 1, 6, 7, 8, 4, 0])
>>> flist[idx]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: only integer scalar arrays can be converted to a scalar index
>>> np.array(flist)[idx]
array(['days_with_cr_line', 'log_annual_inc', 'dti', 'pub_rec',
'installment', 'revol_bal', 'revol_util', 'inq_last_6mths', 'fico',
'int_rate'], dtype='<U17')
>>> fimportance[idx]
array([ 8.19261323, 10.71979245, 13.38566487, 22.7037686 ,
23.95686725, 27.69341779, 64.96469182, 218.77495366,
219.41737141, 250.14120228])
idx is the order in which you need to put elements of fimportance to sort it. The order of flist must match the order of fimportance, so index both with idx.
As a result, elements of np.array(flist)[idx] correspond to elements of fimportance[idx].

np.asarray error: could not broadcast input array from shape (2,2) into shape (2)

I am experimenting with influence functions to understand blackbox models. I am encountering broadcast error while working with a toy dataset of 2 features and 2 classes. Below, I have summarized the actual error using two lists a1 and a2.
a1 = [array([[-0.00491985, 0.00491965],
[-0.00334969, 0.00334955],
[-0.00136081, 0.00136076]], dtype=float32),
array([-0.00104678, 0.00104674], dtype=float32)]
a2 =
[array([[-0.00334969, 0.00334955],
[-0.00136081, 0.00136076]], dtype=float32),
array([-0.00104678, 0.00104674], dtype=float32)]
I am trying to convert the above two lists into arrays using np.asarray()
print(np.asarray(a1))
array([array([[-0.00491985, 0.00491965],
[-0.00334969, 0.00334955],
[-0.00136081, 0.00136076]], dtype=float32),
array([-0.00104678, 0.00104674], dtype=float32)], dtype=object)
While np.asarray(a1) works fine, np.asarray(a2) throws the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-51-3060768e9016> in <module>()
----> 1 np.asarray(a2)
/home/devi/.local/lib/python3.5/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not broadcast input array from shape (2,2) into shape (2)
I went through many forums describing broadcasting errors but still could not figure out the working style of np.asarray().
When the elements of list are arrays of dimensions (3x2)and (1x2), np.asarray() returns an array of length 2. Whereas, when the elements are of dimensions (2x2) and (1x2), why does it throw an error? instead of returning an array of length 2 as in the previous case.. Any help to understand the same will be greatly appreciated!
First you need to reshape all arrays to have the same number dimentions.
And then you should convert it to a numpy array
a2 = [a.reshape(-1, 2) for a in a2]
a2 = np.array(a2)

Find edge points of numpy array for kmeans centroids initialization

I am working on implementing a kmeans algorithm in python.
I am testing out new ways of initializing my centroids and wanted to implement it and see what affect it would have on the cluster.
My idea is to select datapoints from my data set in a way that the centroids are initialized to edge points of my data.
Simple example 2 attribute example:
Lets say this is my input array
input = array([[3,3], [1,1], [-1,-1], [3,-3], [-1,1], [-3,3], [1,-1], [-3,-3]])
From this array I would like to select the edges points which would be [3,3] [-3,-3] [-3,3] [3,-3]. So if my k is 4, these points would be selected
In the data that I am working with has 4 and 9 attributes and around 300 data points in my data set
note: I have no found a solution to when k <> edge points but if k is > edge points I think I would select these 4 points and then try to place the rest of them around the center point of the graph
I have also thought about finding max and min for each column and from there try to find the edges of my data set but I don't have an idea of an effective way of identifying the edges from these values.
If you believe this idea will not work I would love to hear what you have to say.
Questions
Does numpy have such a function to get the indexes of data points on the edge of my data set?
If not, how would I go at finding these edge points in my data set?
Use scipy and pair-wise distances to find how farther each one is from another:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
Then, use sqaureform to get p vector into a matrix shape:
s=squareform(pdist(input))
Then, use numpy argwhere to find the indices where values are max or are extreme, and then look up those indices in the input array:
input[np.argwhere(s==np.max(p))]
array([[[ 3, 3],
[-3, -3]],
[[ 3, -3],
[-3, 3]],
[[-3, 3],
[ 3, -3]],
[[-3, -3],
[ 3, 3]]])
Complete code would be:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
s=squareform(p)
input[np.argwhere(s==np.max(p))]

matplotlib.pyplot errorbar ValueError depends on array length?

Good afternoon.
I've been struggling with this for a while now, and although I can find similiar problems online, nothing I found could really help me resolve it.
Starting with a standard data file (.csv or .txt, I tried both) containing three columns (x, y and the error of y), I want to read in the data and generate a line plot including error bars.
I can plot the x and y values without a problem, but if I want to add errorbars using the matplotlib.pyplot errorbar utility, I get the following error message:
ValueError: yerr must be a scalar, the same dimensions as y, or 2xN.
The code below works if I use some arbitrary arrays (numpy or plain python), but not for data read from the file. I've tried converting the tuples which I obtain from my input code to numpy arrays using asarray, but to no avail.
import numpy as np
import matplotlib.pyplot as plt
row = []
with open("data.csv") as data:
for line in data:
row.append(line.split(','))
column = zip(*row)
x = column[0]
y = column[1]
yer = column[2]
plt.figure()
plt.errorbar(x,y,yerr = yer)
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
fig.savefig('example.png', dpi=300)
It must be that I am overlooking something. I would be very grateful for any thoughts on the matter.
yerr should be the added/subtracted error from the y value. In your case the added equals the subtracted equals half of the third column.
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt('data.csv', delimiter=',')
plt.figure()
yerr_ = np.tile(data[:, 2]/2, (2, 1))
plt.errorbar(data[:, 0], data[:, 1], yerr=yerr_)
plt.xlim([-1, 3])
plt.show()
data.csv
0,2,0.3
1,4,0.4
2,3,0.15

Python 2.7: looping over 1D fibers in a multidimensional Numpy array

I am looking for a way to loop over 1D fibers (row, column, and multi-dimensional equivalents) along any dimension in a 3+-dimensional array.
In a 2D array this is fairly trivial since the fibers are rows and columns, so just saying for row in A gets the job done. But for 3D arrays for example, this expression iterates over 2D slices, not 1D fibers.
A working solution is the one below:
import numpy as np
A = np.arange(27).reshape((3,3,3))
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(A[fiber_index])
However, I am wondering whether there is something that is:
More idiomatic
Faster
Hope you can help!
I think you might be looking for numpy.apply_along_axis
In [10]: def my_func(x):
...: return x**2 + x
In [11]: np.apply_along_axis(my_func, 2, A)
Out[11]:
array([[[ 0, 2, 6],
[ 12, 20, 30],
[ 42, 56, 72]],
[[ 90, 110, 132],
[156, 182, 210],
[240, 272, 306]],
[[342, 380, 420],
[462, 506, 552],
[600, 650, 702]]])
Although many NumPy functions (including sum) have their own axis argument to specify which axis to use:
In [12]: np.sum(A, axis=2)
Out[12]:
array([[ 3, 12, 21],
[30, 39, 48],
[57, 66, 75]])
numpy provides a number of different ways of looping over 1 or more dimensions.
Your example:
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(fiber_index)
print A[fiber_index]
produces something like:
(0, 0)
[0 1 2]
(0, 1)
[3 4 5]
(0, 2)
[6 7 8]
...
generates all index combinations over the 1st 2 dim, giving your function the 1D fiber on the last.
Look at the code for ndindex. It's instructive. I tried to extract it's essence in https://stackoverflow.com/a/25097271/901925.
It uses as_strided to generate a dummy matrix over which an nditer iterate. It uses the 'multi_index' mode to generate an index set, rather than elements of that dummy. The iteration itself is done with a __next__ method. This is the same style of indexing that is currently used in numpy compiled code.
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html
Iterating Over Arrays has good explanation, including an example of doing so in cython.
Many functions, among them sum, max, product, let you specify which axis (axes) you want to iterate over. Your example, with sum, can be written as:
np.sum(A, axis=-1)
np.sum(A, axis=(1,2)) # sum over 2 axes
An equivalent is
np.add.reduce(A, axis=-1)
np.add is a ufunc, and reduce specifies an iteration mode. There are many other ufunc, and other iteration modes - accumulate, reduceat. You can also define your own ufunc.
xnx suggests
np.apply_along_axis(np.sum, 2, A)
It's worth digging through apply_along_axis to see how it steps through the dimensions of A. In your example, it steps over all possible i,j in a while loop, calculating:
outarr[(i,j)] = np.sum(A[(i, j, slice(None))])
Including slice objects in the indexing tuple is a nice trick. Note that it edits a list, and then converts it to a tuple for indexing. That's because tuples are immutable.
Your iteration can applied along any axis by rolling that axis to the end. This is a 'cheap' operation since it just changes the strides.
def with_ndindex(A, func, ax=-1):
# apply func along axis ax
A = np.rollaxis(A, ax, A.ndim) # roll ax to end (changes strides)
shape = A.shape[:-1]
B = np.empty(shape,dtype=A.dtype)
for ii in np.ndindex(shape):
B[ii] = func(A[ii])
return B
I did some timings on 3x3x3, 10x10x10 and 100x100x100 A arrays. This np.ndindex approach is consistently a third faster than the apply_along_axis approach. Direct use of np.sum(A, -1) is much faster.
So if func is limited to operating on a 1D fiber (unlike sum), then the ndindex approach is a good choice.

Resources