Theano - logistic regression example weight vector becomes NaN? - logistic-regression

I am doing a tutorial (code here) and video here (13:00 minutes in).
My only change is using the mnist training set from a different location (creating a one-hot encoding) but it is not working. I literally copy-pasted all the code (except for the mnist loading) in this example. Here is the code:
import theano
from theano import tensor as T
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST Original")
trX, teX, trY_digit, teY_digit = train_test_split(mnist.data, mnist.target, test_size=.4)
#Get one-hot encoding
enc = OneHotEncoder()
enc.fit([[n] for n in range(10)])
trY, teY = sparse_to_floatX(enc.transform(trY_digit[:,newaxis])), sparse_to_floatX(enc.transform(teY_digit[:,newaxis]))
def floatX(X):
return np.asarray(X, dtype=theano.config.floatX)
def init_weights(shape):
return theano.shared(floatX(np.random.randn(*shape) * 0.1))
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
X = T.fmatrix()
Y = T.fmatrix()
w = init_weights((784, 10))
py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)
cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
for i in range(10):
print w.get_value()
cost = train(trX, trY)
print i, predict(teX)
The weight vector updates once, and becomes all NaN on the second update. I am very new to theano, but I am looking for tips to figure this out, especially if someone has already done this tutorial.
UPDATE.
It looks like the gradient is the issue.
When I add this
the_grad = T.sum(gradient)
f_grad = theano.function(inputs=[X, Y], outputs=the_grad, allow_input_downcast=True)
print f_grad(trX, trY)
It prints NaN. This appears to be the correct usage of T.grad though.
UPDATE 2.
When I change the cost function to this:
cost = T.mean(T.sum(T.sqr(py_x - Y), axis=1), axis=0)
It is working now but I only have 70% accuracy which is really bad.
UPDATE 3.
I downloaded the MNIST data used in the tutorial and it worked with 92% accuary.
I am not sure why my first mnist datasource was dying with the crossentropy cost, and then performing really poor with mean squared error cost function.

Related

Interpret a bootstrap output

I applied bootstrapping for a logistic regression model. As far as I understood correctly, the biasin the bootstrap output should help to evaluate, whether my logistic regression model is representative for the true population, right? I have presence-only data of two time points in a rather small study area (about 500 ha) and I want to test an elevational upward-shift of the species distribution since the past time point. Thereby, I randomly selected pseudo-absences in the study area and it might be good to evaluate the reliability of the model.
If bootstrapping is a good way to go for me, I still get stuck with the interpretation of the bootstrapping output.
The output looks like this:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = comdat, statistic = boot.h, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 4.776178e+01 2.190671e+00 2.853876e+01
t2* 1.709549e-02 1.580777e-04 9.383476e-03
t3* -4.969446e-05 -8.607768e-07 2.099336e-05
t4* -9.436566e+01 -3.473776e+00 3.345420e+01
t5* -5.454165e-02 -2.403711e-03 3.085717e-02
t6* 1.488151e-05 6.536446e-07 8.284733e-06
t7* 1.024497e-01 3.803788e-03 3.604996e-02
t8* -2.725339e-05 -1.028595e-06 9.672357e-06
Questions:
Although some predictors turned out as significant in the orignial model, I wonder whether that is meaningful, as the orginal coeffcients are far below zero?
How can I judge about whether the bias is big or not? So, whether my original model is okay or not?
Thanks in advance and sorry, if I didnt get it well in general. Thats possible, as I am very unsure about the bootstrapping and also the reliability of my model (I included 150 randomly selected pseudo-absences)
boot.h <- function(data, indices) {
data <- data[indices, ]
mod <- glm(formula = mound ~
+ aspect + I(aspect^2) + + year
+ elevation +I(elevation^2) + elevation:year +year:I(elevation^2)
, family = binomial, data
= data)
coefficients (mod)
}
boot.k <- boot(data = comdat, statistic = boot.h, R = 999)
plot(boot.k, index = 2)
plot(boot.k, index = 3)
plot(boot.k, index = 4)
plot(boot.k, index = 5)
plot(boot.k, index = 6)
plot(boot.k, index = 7)
plot(boot.k, index = 8)
boot.conf.2 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=2)
boot.conf.3 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=3)
boot.conf.4 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=4)

spatial interpolation with gaussion process regression

I have a csv-file with 140.000 points(rows). It consists of:
longitude value
latitude value
subsidence value at specific points. I assume that these points are spatially correlated.
I want to perform a spatial interpolation analysis of the area of the points. Meaning, I will do a geostatistical interpolation analysis using for example Kriging i.e gaussian process regression.
I'm reading on the sci-kit learn page about gaussian regression. But I'm unsure how to implement it.
What characteristics determine which kernel I can use? How do I implement this with my spatial data correctly?
First, you should convert your data to a projected coordinate system. The best one depends on where your data are located; essentially you want the conformal projection with the least amount of distortion for your location (e.g. Mercator near the equator, or Transverse Mercator if your data are all close to a single meridian. You can achieve this in geopandas for example:
import pandas as pd
import geopandas as gpd
data = {'latitude': [54, 56, 58], 'longitude': [-62, -63, -64], 'subsidence': [10, 20, 30]}
df = pd.DataFrame(data)
params ={
'geometry': gpd.points_from_xy(df.longitude, df.latitude),
'crs': 'epsg:4326', # WGS84
}
gdf_ = gpd.GeoDataFrame(df, **params)
gdf = gdf_.to_crs('epsg:2961') # UTM20N
gdf
This GeoDataFrame is now in projected coordinates. Now you can do some spatial prediction:
import numpy as np
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor
kernel = RBF(length_scale=100_000)
gpr = GaussianProcessRegressor(kernel=kernel)
X = np.array([gdf.geometry.x, gdf.geometry.y]).T
y = gdf.subsidence
gpr.fit(X, y)
Now you can predict at a location, e.g. gpr.predict([(500_000, 5_900_000)]) gives array([22.86764555]) for my toy data.
To predict on a grid, you could do this:
x_min, x_max = np.min(gdf.geometry.x) - 10_000, np.max(gdf.geometry.x) + 10_000
y_min, y_max = np.min(gdf.geometry.y) - 10_000, np.max(gdf.geometry.y) + 10_000
grid_y, grid_x = np.mgrid[y_min:y_max:10_000, x_min:x_max:10_000]
X_grid = np.stack([grid_x.ravel(), grid_y.ravel()]).T
y_grid = gpr.predict(X_grid).reshape(grid_x.shape)
Things to think about:
You should read the docs for geopandas and sklearn.gaussian_process
You should fit the kernel to your data.
You might want to use an anisotropic kernel.
The estimator has a few hypterparameters which you should pay attention to.
Don't forget to do some validation of your estimates, check the distribution of the residuals, etc.
You might want to use a specialist geostats package like gstools, which will do a lot of the fiddly things for you.

Plotting a 2D vector with separate component arrays

I feel like this is a pretty basic question but I can't see to get my head around it. I have a velocity vector V with two components in x and in y that both depend on time. v_x(t) = sin(at) and v_y(t) = exp(bt).
I have created an array for t ranging from 0 to 100 with the function np.arange(0,100,1). I want to plot with matplotlib the resulting vector and its evolution with respect to t. How do I do that?
A simple way that you might try is the following:
import matplotlib.pyplot as plt
import numpy as np
t = np.arange(0,100,1)
a = 0.1
b = 0.05
vel = np.array([np.sin(a*t), np.exp(b*t)],float)
plt.plot(vel[0,:],vel[1,:])
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
This gave me the plot
The line vel = np.array([np.sin(a*t), np.exp(b*t)],float) basically does all the magic. np.sin(a*t) makes a new array using each value in t to calculate each element (and np.exp() works similarly).
It would also be possible (and fun) to make an animation of the evolution of the vector.

Why is Pymc3 ADVI worse than MCMC in this logistic regression example?

I am aware of the mathematical differences between ADVI/MCMC, but I am trying to understand the practical implications of using one or the other. I am running a very simple logistic regressione example on data I created in this way:
import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
def logistic(x, b, noise=None):
L = x.T.dot(b)
if noise is not None:
L = L+noise
return 1/(1+np.exp(-L))
x1 = np.linspace(-10., 10, 10000)
x2 = np.linspace(0., 20, 10000)
bias = np.ones(len(x1))
X = np.vstack([x1,x2,bias]) # Add intercept
B = [-10., 2., 1.] # Sigmoid params for X + intercept
# Noisy mean
pnoisy = logistic(X, B, noise=np.random.normal(loc=0., scale=0., size=len(x1)))
# dichotomize pnoisy -- sample 0/1 with probability pnoisy
y = np.random.binomial(1., pnoisy)
And the I run ADVI like this:
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 0, sd=10)
x1_coef = pm.Normal('x1', 0, sd=10)
x2_coef = pm.Normal('x2', 0, sd=10)
# Define likelihood
likelihood = pm.Bernoulli('y',
pm.math.sigmoid(intercept+x1_coef*X[0]+x2_coef*X[1]),
observed=y)
approx = pm.fit(90000, method='advi')
Unfortunately, no matter how much I increase the sampling, ADVI does not seem to be able to recover the original betas I defined [-10., 2., 1.], while MCMC works fine (as shown below)
Thanks' for the help!
This is an interesting question! The default 'advi' in PyMC3 is mean field variational inference, which does not do a great job capturing correlations. It turns out that the model you set up has an interesting correlation structure, which can be seen with this:
import arviz as az
az.plot_pair(trace, figsize=(5, 5))
PyMC3 has a built-in convergence checker - running optimization for to long or too short can lead to funny results:
from pymc3.variational.callbacks import CheckParametersConvergence
with model:
fit = pm.fit(100_000, method='advi', callbacks=[CheckParametersConvergence()])
draws = fit.sample(2_000)
This stops after about 60,000 iterations for me. Now we can inspect the correlations and see that, as expected, ADVI fit axis-aligned gaussians:
az.plot_pair(draws, figsize=(5, 5))
Finally, we can compare the fit from NUTS and (mean field) ADVI:
az.plot_forest([draws, trace])
Note that ADVI is underestimating variance, but fairly close for the mean of each parameter. Also, you can set method='fullrank_advi' to capture the correlations you are seeing a little better.
(note: arviz is soon to be the plotting library for PyMC3)

dask.array.reshape very slow

I have an array that I iteratively build up like follows:
step1.shape = (200,200)
step2.shape = (200,200,200)
step3.shape = (200,200,200,200)
and then reshape to:
step4.shape = (200,200**3)
I do this because dask.array.atop doesn't seem to allow you to go from a shape like this: (200,200) -> (200,200**2). I think this is so that it is related to chunking and lazy evaluation.
When I do step4 and try to reshape it, dask seems to want to compute the matrix prior to reshaping it which results in significant computation time and memory use.
Is there a way to avoid this?
As requested, here is some dummy code:
def prod_mat(matrix_a,matrix_b):
#mat_a.shape = (300,...,300,200)
#mat_b.shape = (300, 200)
mat_a = matrix_a.reshape(-1,matrix_a.shape[-1])
#mat_a = (300**n,200)
mat_b = matrix_b.reshape(-1,matrix_b.shape[-1])
#mat_b = (300,200)
mat_temp = np.repeat(mat_a,matrix_b.shape[0],axis=0)*np.tile(mat_b.T,mat_a.shape[0]).T
new_dim = int(math.log(mat_temp.shape[0])/math.log(matrix_a.shape[0]))
new_shape = [matrix_a.shape[0] for n in range(new_dim)]
new_shape.append(-1)
result = mat_temp.reshape(tuple(new_shape))
#result.shape = (300,...,300,300,200)
return result
b = np.random.rand(300,200)
b = da.from_array(b,chunks=100)
c=da.atop(prod_mat,'ijk',b,'ik',b,'jk')
d=da.atop(prod_mat,'ijkl',c,'ijl',b,'kl')
e=da.atop(prod_mat,'ijklm',d,'ijkm',b,'lm')
f = e.sum(axis=-1)
f.reshape(300,300**3) ----> This is slow, as if it is using compute()
This computation isn't calling compute, instead it's stuck making a very very large graph. Generally speaking reshaping parallel arrays is pretty intense. Lots of your little chunks end up talking to lots of your other little chunks, creating havoc. This example is particularly bad.
Perhaps there is another way to produce your output in the correct shape initially?
Looking through the development logs it appears that this failure was actually anticipated during development: https://github.com/dask/dask/pull/758

Resources