Logistic Regression and Sparse matrices -AttributeError: 'bool' object has no attribute 'any' - sparse-matrix

I am trying to use logistic regression with sparse matrices, because it may work faster. Problem is, I get errors and warnings that I do not understand. I will show you some code. I warn you, I am new to this, so if you can pinpoint any unnecessary-bad code of me, please say so.
My logic is the following: (I will present code as well if written text does not help at all)
1) Train_set and test_set all at one set to perform the preprocessing at once (fill gaps, onehotencdoing etc) and to transform everything in sparse form
2) Then, after preprocessing, I need to slice this set into the 2 sets, one for train (to make the model) and the test (which I want to predict)
3) To slice though, I transform from coo to csr, otherwise I cannot do it.
4) After I sliced, I do the usual for modelling and then problems occur.
Time to show some code:
# read csv
train_set = pd.read_csv('train.csv', sep=',', nrows=10000, keep_default_na=True)
test_set = pd.read_csv('test.csv', sep=',', nrows=10000, keep_default_na=True)
#all_set includes both train & read data
all_set = pd.concat([train_set, test_set], sort=False)
# Pass values of train_set to X
X = all_set[all_set.columns]
X = X.drop(['id', 'target'], axis=1)
# Pass target values to Y and convert it to a sparse matrix
Y = train_set['target']
Y = sparse.csr_matrix(Y)
Y = csr_matrix.transpose(Y)
(after preprocessing)
# Seperate data into Train and Test with preprocessing complete
# first I transform coo to csr (fro new_Train) because for coo slicing is unavailable
csr_Train = new_Train.tocsr()
final_train_set = csr_Train[0:len(train_set['target']), :]
final_test_set = csr_Train[len(train_set['target']):all_set.shape[0], :]
Y contains my target column to use for training and
final_train_set is my train data
print("shape and type", final_train_set.shape, type(final_train_set))
print("shape and type", Y.shape, type(Y))
Results: (edit: Even if both were coo or both were csr, I got the same errors and warnings)
Seeing the same shape and all, all optimistic I proceed to modelling.
X_train, X_test, y_train, y_test = train_test_split(final_train_set, Y, random_state=42, test_size=0.2)
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)
The shape and type are the same. And here are the results...
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\fixes.py:192: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
return X != X
C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\fixes.py:192: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
return X != X
Traceback (most recent call last):
File "C:/Users/kosta/PycharmProjects/cat_dat/Cat_Dat_v2.py", line 110, in <module>
lr.fit(X_train, y_train)
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py", line 1532, in fit
accept_large_sparse=solver != 'liblinear')
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 725, in check_X_y
_assert_all_finite(y)
File "C:\Users\kosta\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 59, in _assert_all_finite
if _object_dtype_isnan(X).any():
AttributeError: 'bool' object has no attribute 'any'
Process finished with exit code 1
To be honest, I don't understand what's wrong (neither the warnings nor the errors) and I don't know how to proceed, apart from many trials and researching on net that I did for hours. So any help will do!
Thank you in advance for your time!

It seems the problem was with the target column. I should have not converted it at sparse form. I should have left it as a pandasSeries. It seems that for model to work, the 2 arguments don't need to be of the same type.

Related

Structuring a for loop to output classifier predictions in python

I have an existing .py file that prints a classifier.predict for a SVC model. I would like to loop through each row in the X feature set to return a prediction.
I am currently trying to define the element from which to iterate over so as to allow for definition of the test statistic feature set X.
The test statistic feature set X is written in code as:
X_1 = xspace.iloc[testval-1:testval, 0:5]
testval is the element name used in the for loop in the above line:
for testval in X.T.iterrows():
print(testval)
I am having trouble returning a basic set of index values for X (X is the pandas dataframe)
I have tested the following with no success.
for index in X.T.iterrows():
print(index)
for index in X.T.iteritems():
print(index)
I am looking for the set of index values, with base 1 if possible, like 1,2,3,4,5,6,7,8,9,10...n
seemingly simple stuff...i haven't located an existing question via stackoverflow or google.
ALSO, the individual dataframes I used as the basis for X were refined with the line:
df1.set_index('Date', inplace = True)
Because dates were used as the basis for the concatenation of the individual dataframes the loops as written above are returning date values rather than
location values as I would prefer hence:
X_1 = xspace.iloc[testval-1:testval, 0:5]
where iloc, location is noted
please ask for additional code if you'd like to see more
the loops i've done thus far are returning date values, I would like to return index values of the location of the rows to accommodate the line:
X_1 = xspace.iloc[testval-1:testval, 0:5]
The loop structure below seems to be working for my application.
i = 1
j = list(range(1, len(X),1)
for i in j:

Build Dictionary of Arrays Efficiently in julia

I want to save the (x,y) coordinates in a grid network that are visited by different individuals. Let say I have 1000 individuals and the network size is x = 1:100 and y=1:100. I am using Dict() and here is a sample code about what I want to do:
individuals = 1:1000
x = 1:100
y = 1:100
function Visited_nodes()
nodes_of_inds =Dict{Int64, Array{Tuple{Int64, Int64}}}()
for ind in individuals
dum_array = Array{Tuple{Int64, Int64}}(0)
for i in x
for j in y
if rand()<0.2 # some conditions
push!(dum_array, (i,j))
end
end
end
nodes_of_inds[ind]=unique(dum_array)
end
return nodes_of_inds
end
#time nodes_of_inds = Visited_nodes()
# result: 1.742297 seconds (12.31 M allocations: 607.035 MB, 6.72% gc time)
But this is not efficient. I appreciate any advice how to make it more efficient.
Please see the performance tips. Very first piece of advice there: avoid global variables. individuals, x, and y are all non-constant global variables. Make them arguments to your function instead. That change alone speeds up your function by an order of magnitude.
By construction, you're not going to have any duplicate tuples in your dum_array, so you don't need to call unique. That shaves off another factor of two.
Finally, Array{T} isn't a concrete type. Julia's arrays also encode the dimensionality as a type parameter, which must be included for the dictionary of arrays to be efficient. Use Array{T, 1} or Vector{T} instead. This isn't a major consideration within the time of this function, though.
The major thing that's left is just the O(length(individuals)*length(x)*length(y)) computational complexity. Doing anything ten million times will add up quickly, no matter how efficient it is.
#Matt B., thanks for your response. About the global variables, I tried a simplified version of my code and it did not help the performance.
Let say I read my input data from a couple of csv files and I have three functions with different arguments:
function Read_input_data()
# read input data
individuals = readcsv("file1")
x = readcsv("file2")
y = readcsv("file3")
A = readcsv("file4")
B = readcsv("file5") # and a few other files
# call different functions
result_1 = Function1(individuals , x, y)
result_2 = Function2(result_1 ,y, A, B)
result_3 = Function3(result_2 , individuals, A, B)
return result_1, result_2, result_3
end
result_1, result_2, result_3 = Read_input_data()
I do not know why the performance is not better compared to when I define everything global! I appreciate any if you can comment about this!

Subscript indices error while indexing array within symbolic function symsum

I'm trying to solve an easy recursive equation, but I'm encountered with very rudimentary problems that I think a MATLAB expert can easy resolve.
So here is the short version of my code:
clear all
%%%INPUT DATA
gconst = [75 75];
kconst = [200 200];
tau = [.01667 .14153];
%%% TIME Span
t = [0 .001 .002 .003 .004 .005];
%%% Definition of the functions g(x) and k(y)
syms g(x) k(y)
g(x) = gconst(1)*exp(-x/tau(1))+gconst(2)*exp(-x/tau(2));
k(y) = kconst(1)*exp(-y/tau(1))+kconst(2)*exp(-y/tau(2));
%%% Defining initial conditons
nu = zeros(1,7);
nu(1)= 3.64e-1;
nu(2)= 3.64e-1;
%%% nu(3) is required
int(sym('i'))
nu(3)=nu(1)*(3*k(t(3)-t(2))+g(t(3)-t(2))-g(t(3)))...
+symsum(nu(i)*(3*k(t(3)-t(i+1))-3*k(t(3)-t(i-1))... %symsum line 1
+g(t(3)-t(i+1))-g(t(3)-t(i-1))), i, 1, 3))... %symsum line 2
/(3*k(0)+g(0));
You can ignore the whole symsum part, because without, the code still doesn't work.
It is a very straightforward code, but after running it, I get this error:
Subscript indices must either be real positive integers or logicals.
This error is found in the line where I defined nu(3).
I'd like to hear your comments.
EDIT 1: k(y) instead of k(x).
EDIT 2: zeros(1,7) instead of zeros(7).
NOTE 1: The code works without the symsum part and after EDIT 1.
What you want can't be done.
The reason is, that you are indexing an array t = [0 .001 .002 .003 .004 .005] with the symbolic summation index i.
So while
syms i
S1 = symsum( i, i, 1,3)
works
syms t i
t = [1 2 3];
S1 = symsum( t(i), i, 1,3)
won't work, and there is no way around it, because the values 1 ... 3 are evaluated after indexing. You need to rethink your approach completely.
Apart from that you probably want k(y) instead of k(x). That was the reason why the code didn't work without the symsum part neither.
Using i as a variable name is not an issue anymore, but shouldn't be used to avoid misunderstandings.

How to write a random array (with no spatial reference) to geotiff format?

The following MATLAB script generates random locations within a 300x400 array and codes those locations with values from 1-12. How can I convert this non-spatial array to a geotiff? I hope to use the geotiff output to perform some trial analyses. Any projected coordinate system (e.g. UTM) would do for this analysis.
I have tried using geotiffwrite() without success using the following implementation:
out = geotiffwrite('C:\path\to\file\test.tif', m)
Which yields the following error:
>> test
Error using geotiffwrite
Too many output arguments.
EDIT:
The main problem I am encountering is a lack of inputs into the geotiffwrite() function. I am unsure how to deal with this problem. For example, I have no A or R variable because the array has no spatial reference. As long as each pixel is georeferenced somewhere, I do not care what the spatial reference is. The purpose of this is to create a sample dataset that I can experiment with using MATLAB spatial functions.
% Generate a totally black image to start with.
m = zeros(300, 400, 'uint8');
% Generate 1000 random locations.
numRandom = 1000;
linearIndices = randi(numel(m), 1, numRandom);
% Set those locations to be "white".
m(linearIndices) = randi(12, [numel(linearIndices) 1]);
% Display it. Random locations will appear white.
image(m);
colormap(gray);
I believe your question has a very simple answer. Skip the out-variable when you call geotiffwrite. That is, use:
geotiffwrite('C:\path\to\file\test.tif', m)
Instead of
out = geotiffwrite('C:\path\to\file\test.tif', m)
This is example of a working code using geotiffwrite, taken from the documentation. As you can see, there is no output variable there:
basename = 'boston_ovr';
imagefile = [basename '.jpg'];
RGB = imread(imagefile);
worldfile = getworldfilename(imagefile);
R = worldfileread(worldfile, 'geographic', size(RGB));
filename = [basename '.tif'];
geotiffwrite(filename, RGB, R)
figure
usamap(RGB, R)
geoshow(filename)
Update:
According to the documentation, you need at least 3 input parameters. The correct syntax is:
geotiffwrite(filename,A,R)
geotiffwrite(filename,X,cmap,R)
geotiffwrite(...,Name,Value)
From documentation:
geotiffwrite(filename,A,R) writes a georeferenced image or data grid,
A, spatially referenced by R, into an output file, filename.
Please visit this link to see how to use the function.

Importing Data into Simulink Block Diagrams Issue

To all MATLAB and Simulink users,
I am doing a project and faced a problem importing data from a 'Signal From Workspace' in Simulink block.
My case:
I need to input 565 rows of 2 columns of data over a sample period of 22seconds into my Simulink block diagram. Each data sample time is 22/565.
However, the output data is a [565 x 2] which affects the input to the downstream Simulink blocks due to dimensions issue.
For example, Ideally, [1 x 2] output multiplies with [2 x 1] and repeat for 565 times over 22 seconds. Now, [565 x 2] output signal couldn't get through due to the dimension.
My attempts to solve the problem:
I tried using 'From workspace' instead of 'Signal From Workspace' but face some problems.
t=[0:22/565:22]' M (565 rows n 2 columns of values) data.time=t; data.signals.values = M; data.signals.dimensions=[565 2];
This error pops up when simulation is run:
*"Invalid structure-format variable specified as workspace input in 'test/From Workspace'. The structure 'dimensions'field must be a scalar or a vector with 2 elements. In addition, this field must be compatible with the dimensions of input signal stored in the 'values' field. " *
I greatly appreciate if anybody can provide insight/solutions/alternative method to my case.
THANK YOU!
Regards,
KO
It looks like you should be using
data.signals.dimensions = 2;
For example
>> t= linspace(0,10,1001)';
>> data.time = t;
>> data.signals.values = [sin(t) cos(t)];
>> data.signals.dimensions = 2;

Resources