baseline fitting using Numpy poly1d - arrays

i have the following baseline:
and as it can be seen, it has an almost sinusoidal shape. i am trying to use polyfit on it. Actually what I have are two arrays of data,one called x and the other y. So what i am using is:
porder = 2
coefs = np.polyfit(x, y, porder)
baseline = np.poly1d(coefs)
cleanspec = y - baseline(x)
My goal is to obtain a clean spectrum in the end, who has a straight baseline with no ondulation.
However, the fitting is not working. Any suggestions on using another more efficient method?
I have tried changing porder to 3, but i have this warning, and it doesn't change anything:
Polyfit may be poorly conditioned
My data for x:
[1.10192816e+11 1.10192893e+11 1.10192969e+11 1.10193045e+11
1.10193122e+11 1.10193198e+11 1.10193274e+11 1.10193350e+11
1.10193427e+11 1.10193503e+11 1.10193579e+11 1.10193656e+11
1.10193732e+11 1.10193808e+11 1.10193885e+11 1.10193961e+11
1.10194037e+11 1.10194113e+11 1.10194190e+11 1.10194266e+11
1.10194342e+11 1.10194419e+11 1.10194495e+11 1.10194571e+11
1.10194647e+11 1.10194724e+11 1.10194800e+11 1.10194876e+11
1.10194953e+11 1.10195029e+11 1.10195105e+11 1.10195182e+11
1.10195258e+11 1.10195334e+11 1.10195410e+11 1.10195487e+11
1.10195563e+11 1.10195639e+11 1.10195716e+11 1.10195792e+11
1.10195868e+11 1.10195944e+11 1.10196021e+11 1.10196097e+11
1.10196173e+11 1.10196250e+11 1.10196326e+11 1.10196402e+11
1.10196479e+11 1.10196555e+11 1.10196631e+11 1.10196707e+11
1.10196784e+11 1.10196860e+11 1.10196936e+11 1.10197013e+11
1.10197089e+11 1.10197165e+11 1.10197241e+11 1.10197318e+11
1.10197394e+11 1.10197470e+11 1.10197547e+11 1.10197623e+11
1.10197699e+11 1.10197776e+11 1.10197852e+11 1.10197928e+11
1.10198004e+11 1.10198081e+11 1.10198157e+11 1.10198233e+11
1.10198310e+11 1.10198386e+11 1.10198462e+11 1.10198538e+11
1.10198615e+11 1.10198691e+11 1.10198767e+11 1.10198844e+11
1.10198920e+11 1.10198996e+11 1.10199073e+11 1.10199149e+11
1.10199225e+11 1.10199301e+11 1.10199378e+11 1.10199454e+11
1.10199530e+11 1.10199607e+11 1.10199683e+11 1.10199759e+11
1.10199835e+11 1.10199912e+11 1.10199988e+11 1.10200064e+11
1.10200141e+11 1.10202582e+11 1.10202658e+11 1.10202735e+11
1.10202811e+11 1.10202887e+11 1.10202963e+11 1.10203040e+11
1.10203116e+11 1.10203192e+11 1.10203269e+11 1.10203345e+11
1.10203421e+11 1.10203498e+11 1.10203574e+11 1.10203650e+11
1.10203726e+11 1.10203803e+11 1.10203879e+11 1.10203955e+11
1.10204032e+11 1.10204108e+11 1.10204184e+11 1.10204260e+11
1.10204337e+11 1.10204413e+11 1.10204489e+11 1.10204566e+11
1.10204642e+11 1.10204718e+11 1.10204795e+11 1.10204871e+11
1.10204947e+11 1.10205023e+11 1.10205100e+11 1.10205176e+11
1.10205252e+11 1.10205329e+11 1.10205405e+11 1.10205481e+11
1.10205557e+11 1.10205634e+11 1.10205710e+11 1.10205786e+11
1.10205863e+11 1.10205939e+11 1.10206015e+11 1.10206092e+11
1.10206168e+11 1.10206244e+11 1.10206320e+11 1.10206397e+11
1.10206473e+11 1.10206549e+11 1.10206626e+11 1.10206702e+11
1.10206778e+11 1.10206854e+11 1.10206931e+11 1.10207007e+11
1.10207083e+11 1.10207160e+11 1.10207236e+11 1.10207312e+11
1.10207389e+11 1.10207465e+11 1.10207541e+11 1.10207617e+11
1.10207694e+11 1.10207770e+11 1.10207846e+11 1.10207923e+11
1.10207999e+11 1.10208075e+11 1.10208151e+11 1.10208228e+11
1.10208304e+11 1.10208380e+11 1.10208457e+11 1.10208533e+11
1.10208609e+11 1.10208686e+11 1.10208762e+11 1.10208838e+11
1.10208914e+11 1.10208991e+11 1.10209067e+11 1.10209143e+11
1.10209220e+11 1.10209296e+11 1.10209372e+11 1.10209448e+11
1.10209525e+11 1.10209601e+11 1.10209677e+11 1.10209754e+11
1.10209830e+11]
and for y:
[ 0.00143858 0.05495827 0.07481739 0.03287334 -0.06275658 0.03744501
-0.04392341 0.02849104 0.03173781 0.09748282 0.02854265 0.06573162
0.08215295 0.0240697 0.00931477 0.17572605 0.06783381 0.04853354
-0.00226023 0.03722596 0.09687121 0.10767829 0.04922701 0.08036865
0.02371989 0.13885361 0.13903188 0.09910567 0.08793601 0.06048823
0.03932097 0.04061129 0.03706228 0.13764936 0.14150589 0.12226208
0.09041878 0.13638676 0.11107155 0.12261369 0.11765545 0.07425344
0.06643712 0.1449991 0.14256909 0.0924173 0.09291525 0.12216271
0.11272059 0.07618891 0.16787807 0.07832849 0.10786856 0.12381844
0.14182937 0.08078092 0.11932429 0.06383649 0.02923562 0.0864741
0.07806758 0.04514088 0.12929371 0.11769577 0.03619867 0.02811366
0.06401639 0.06883735 0.01162673 0.0956252 0.11206549 0.0485106
0.07269545 0.01662149 0.01287365 0.13401546 0.06300487 0.01994627
0.00721926 0.04863274 -0.01578364 0.0235379 0.03102316 0.00392559
0.05662182 0.04643381 -0.00665026 0.05532307 -0.01533339 0.04838893
0.02097954 0.02551123 0.03727188 -0.04001189 -0.04294883 0.02837669
-0.06062512 -0.0743994 -0.04665618 -0.03553261 -0.07057554 -0.07028277
-0.07502298 -0.07247965 -0.03540266 -0.03226398 -0.08014487 -0.11907543
-0.18521053 -0.1117617 -0.14377897 -0.07113503 -0.02480966 -0.07459746
-0.07994097 -0.02648713 -0.10288478 -0.13328137 -0.08121377 -0.13742166
-0.024583 -0.11391389 -0.02717251 -0.08876166 -0.04369363 -0.0790144
-0.09589054 -0.12058701 0.00041344 -0.06646403 -0.06368366 -0.10335613
-0.04508286 -0.18360729 -0.0551775 -0.06476622 -0.0834523 -0.01276785
-0.04145486 -0.14549992 -0.11186823 -0.07663398 -0.11920359 -0.0539315
-0.10507118 -0.09112374 -0.09751319 -0.06848278 -0.09031172 -0.07218853
-0.03129234 -0.04543539 -0.00942861 -0.06711099 -0.00712202 -0.11696418
-0.06344093 0.03624227 -0.04798777 0.01174394 -0.08326314 -0.06761215
-0.12063419 -0.05236908 -0.03914692 -0.05370061 -0.01620056 0.06731788
-0.06600111 -0.04601257 -0.02144361 0.00256863 -0.00093034 0.00629604
-0.0252835 -0.00907992 0.03583489 -0.03761906 0.10325763 0.08016437
-0.04900467 0.0110328 0.05019604 -0.04428984 -0.03208058 0.05095359
-0.01807463 0.0691733 0.07472691 0.00659871 0.00947692 0.0014422
0.05227057]

Having this huge offset in x is probably not helping. It definitively works when removing it for the fitting process. Looks like this:
import matplotlib.pyplot as plt
import numpy as np
scaledx = xdata * 1e-8 - 1100
coefs = np.polyfit( scaledx, ydata, 7)
base = np.poly1d( coefs )
xt = np.linspace( 1.9,2.1,150)
yt = base( xt )
fig = plt.figure()
ax = fig.add_subplot( 2, 1, 1 )
bx = fig.add_subplot( 2, 1, 2 )
ax.scatter( scaledx , ydata )
ax.plot( xt , yt )
bx.plot( scaledx , ydata - base( scaledx ) )
plt.show()
with xdata and ydata being numpy arrays of the OP data lists.
Provides:
Addon
Concerning the poorly conditioned one should remember how simple linear optimization works. In case of a polynomial one builds the matrix:
A = [
[1, x1, x1**2, ...],
[1, x2, x2**2, ...],
...
[1, xn, xn**2, ...]
]
and one needs B^(-1) the inverse of B with B = AT.A and AT being the transposed of A. Now looking at the x values in the order of 1e11, B will have order 1 on one side of the diagonal and for a second order polynomial order 1e44 on the other. In case of a third order polynomial this is getting worse, accordingly. Making the inverse, hence, is becoming unstable, numerically. Luckily, and as used above, this can be solved easily by simple re-scaling of the problem at hand.

Related

Modelling and fitting bi-modal lognormal distributions in a loop using lmfit

I have been spending FAR too much time trying to figure this out - so time to seek help. I am attempting to use lmfit to fit two lognormals (a and c) as well as the sum of these two lognormals (a+c) to a size distribution. Mode a centers around x=0.2, y=1, mode c centers around x=1.2, y=<<<1. There are numerous size distributions (>200) which are all slightly different and are passed in to the following code from an outside loop. For this example, I have provided a real life distribution and have not included the loop. Hopefully my code is sufficiently annotated to allow understanding of what I am trying to achieve.
I must be missing some fundamental understanding of lmfit (spoiler alert - I'm not great at Maths either) as I have 2 problems:
the fits (a, c and a+c) do not accurately represent the data. Note how the fit (red solid line) diverts away from the data (blue solid line). I assume this is something to do with the initial guess parameters. I have tried LOTS and have been unable to get a good fit.
re-running the model with "new" best fit values (results2, results3) doesn't appear to significantly improve the fit at all. Why?
Example result using provided x and y data:
Here is one-I-made-earlier showing the type of fit I am after (produced using the older mpfit module, using different data than provided below and using unique initial best guess parameters (not in a loop). Excuse the legend format, I had to remove certain information):
Any assistance is much appreciated. Here is the code with an example distribution:
from lmfit import models
import matplotlib.pyplot as plt
import numpy as np
# real life data example
y = np.array([1.000000, 0.754712, 0.610303, 0.527856, 0.412125, 0.329689, 0.255756, 0.184424, 0.136819,
0.102316, 0.078763, 0.060896, 0.047118, 0.020297, 0.007714, 0.010202, 0.008710, 0.005579,
0.004644, 0.004043, 0.002618, 0.001194, 0.001263, 0.001043, 0.000584, 0.000330, 0.000179,
0.000117, 0.000050, 0.000035, 0.000017, 0.000007])
x = np.array([0.124980, 0.130042, 0.135712, 0.141490, 0.147659, 0.154711, 0.162421, 0.170855, 0.180262,
0.191324, 0.203064, 0.215738, 0.232411, 0.261810, 0.320252, 0.360761, 0.448802, 0.482528,
0.525526, 0.581518, 0.658988, 0.870114, 1.001815, 1.238899, 1.341285, 1.535134, 1.691963,
1.973359, 2.285620, 2.572177, 2.900414, 3.342739])
# create the joint model using prefixes for each mode
model = (models.LognormalModel(prefix='p1_') +
models.LognormalModel(prefix='p2_'))
# add some best guesses for the model parameters
params = model.make_params(p1_center=0.1, p1_sigma=2, p1_amplitude=1,
p2_center=1, p2_sigma=2, p2_amplitude=0.000000000000001)
# bound those best guesses
# params['p1_amplitude'].min = 0.0
# params['p1_amplitude'].max = 1e5
# params['p1_sigma'].min = 1.01
# params['p1_sigma'].max = 5
# params['p1_center'].min = 0.01
# params['p1_center'].max = 1.0
#
# params['p2_amplitude'].min = 0.0
# params['p2_amplitude'].max = 1
# params['p2_sigma'].min = 1.01
# params['p2_sigma'].max = 10
# params['p2_center'].min = 1.0
# params['p2_center'].max = 3
# actually fit the model
result = model.fit(y, params, x=x)
# ====================================
# ================================
# re-run using the best-fit params derived above
params2 = model.make_params(p1_center=result.best_values['p1_center'], p1_sigma=result.best_values['p1_sigma'],
p1_amplitude=result.best_values['p1_amplitude'],
p2_center=result.best_values['p2_center'], p2_sigma=result.best_values['p2_sigma'],
p2_amplitude=result.best_values['p2_amplitude'], )
# re-fit the model
result2 = model.fit(y, params2, x=x)
# ================================
# re-run again using the best-fit params derived above
params3 = model.make_params(p1_center=result2.best_values['p1_center'], p1_sigma=result2.best_values['p1_sigma'],
p1_amplitude=result2.best_values['p1_amplitude'],
p2_center=result2.best_values['p2_center'], p2_sigma=result2.best_values['p2_sigma'],
p2_amplitude=result2.best_values['p2_amplitude'], )
# re-fit the model
result3 = model.fit(y, params3, x=x)
# ================================
# add individual fine and coarse modes using the revised fit parameters
model_a = models.LognormalModel()
params_a = model_a.make_params(center=result3.best_values['p1_center'], sigma=result3.best_values['p1_sigma'],
amplitude=result3.best_values['p1_amplitude'])
result_a = model_a.fit(y, params_a, x=x)
model_c = models.LognormalModel()
params_c = model_c.make_params(center=result3.best_values['p2_center'], sigma=result3.best_values['p2_sigma'],
amplitude=result3.best_values['p2_amplitude'])
result_c = model_c.fit(y, params_c, x=x)
# ====================================
plt.plot(x, y, 'b-', label='data')
plt.plot(x, result.best_fit, 'r-', label='best_fit_1')
plt.plot(x, result.init_fit, 'lightgrey', ls=':', label='ini_fit_1')
plt.plot(x, result2.best_fit, 'r--', label='best_fit_2')
plt.plot(x, result2.init_fit, 'lightgrey', ls='--', label='ini_fit_2')
plt.plot(x, result3.best_fit, 'r.-', label='best_fit_3')
plt.plot(x, result3.init_fit, 'lightgrey', ls='--', label='ini_fit_3')
plt.plot(x, result_a.best_fit, 'grey', ls=':', label='best_fit_a')
plt.plot(x, result_c.best_fit, 'grey', ls='--', label='best_fit_c')
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
There are three main pieces of advice I can give:
initial values matter and should not be so far off as to make
portions of the model completely insensitive to the parameter
values. Your initial model is sort of off by several orders of
magnitude.
always look at the fit result. This is the primary
result -- the plot of the fit is a representation of the actual
numerical results. Not showing that you printed out the fit
report is a good indication that you did not look at the actual
result. Really, always look at the results.
if you are judging the quality of the fit based on a plot of
the data and model, use how you choose to plot the data to guide
how you fit the data. Specifically in your case, if you are
plotting on a log scale, then fit the log of the data to the log
of the model: fit in "log space".
Such a fit might look like this:
from lmfit import models, Model
from lmfit.lineshapes import lognormal
import matplotlib.pyplot as plt
import numpy as np
y = np.array([1.000000, 0.754712, 0.610303, 0.527856, 0.412125, 0.329689, 0.255756, 0.184424, 0.136819,
0.102316, 0.078763, 0.060896, 0.047118, 0.020297, 0.007714, 0.010202, 0.008710, 0.005579,
0.004644, 0.004043, 0.002618, 0.001194, 0.001263, 0.001043, 0.000584, 0.000330, 0.000179,
0.000117, 0.000050, 0.000035, 0.000017, 0.000007])
x = np.array([0.124980, 0.130042, 0.135712, 0.141490, 0.147659, 0.154711, 0.162421, 0.170855, 0.180262,
0.191324, 0.203064, 0.215738, 0.232411, 0.261810, 0.320252, 0.360761, 0.448802, 0.482528,
0.525526, 0.581518, 0.658988, 0.870114, 1.001815, 1.238899, 1.341285, 1.535134, 1.691963,
1.973359, 2.285620, 2.572177, 2.900414, 3.342739])
# use a model that is the log of the sum of two log-normal functions
# note to be careful about log(x) for x < 0.
def log_lognormal(x, amp1, cen1, sig1, amp2, cen2, sig2):
comp1 = lognormal(x, amp1, cen1, sig1)
comp2 = lognormal(x, amp2, cen2, sig2)
total = comp1 + comp2
total[np.where(total<1.e-99)] = 1.e-99
return np.log(comp1+comp2)
model = Model(log_lognormal)
params = model.make_params(amp1=5.0, cen1=-4, sig1=1,
amp2=0.1, cen2=-1, sig2=1)
# part of making sure that the lognormals are strictly positive
params['amp1'].min = 0
params['amp2'].min = 0
result = model.fit(np.log(y), params, x=x)
print(result.fit_report()) # <-- HERE IS WHERE THE RESULTS ARE!!
# also, make a plot of data and fit
plt.plot(x, y, 'b-', label='data')
plt.plot(x, np.exp(result.best_fit), 'r-', label='best_fit')
plt.plot(x, np.exp(result.init_fit), 'grey', label='ini_fit')
plt.xscale("log")
plt.yscale("log")
plt.legend()
plt.show()
This will print out
[[Model]]
Model(log_lognormal)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 211
# data points = 32
# variables = 6
chi-square = 0.91190970
reduced chi-square = 0.03507345
Akaike info crit = -101.854407
Bayesian info crit = -93.0599914
[[Variables]]
amp1: 21.3565856 +/- 193.951379 (908.16%) (init = 5)
cen1: -4.40637490 +/- 3.81299642 (86.53%) (init = -4)
sig1: 0.77286862 +/- 0.55925566 (72.36%) (init = 1)
amp2: 0.00401804 +/- 7.5833e-04 (18.87%) (init = 0.1)
cen2: -0.74055538 +/- 0.13043827 (17.61%) (init = -1)
sig2: 0.64346873 +/- 0.04102122 (6.38%) (init = 1)
[[Correlations]] (unreported correlations are < 0.100)
C(amp1, cen1) = -0.999
C(cen1, sig1) = -0.999
C(amp1, sig1) = 0.997
C(cen2, sig2) = -0.964
C(amp2, cen2) = -0.940
C(amp2, sig2) = 0.849
C(sig1, amp2) = -0.758
C(cen1, amp2) = 0.740
C(amp1, amp2) = -0.726
C(sig1, cen2) = 0.687
C(cen1, cen2) = -0.669
C(amp1, cen2) = 0.655
C(sig1, sig2) = -0.598
C(cen1, sig2) = 0.581
C(amp1, sig2) = -0.567
and generate a plot like

Data arrays must have the same length, and match time discretization in dynamic problems error in GEKKO

I want to find the value of the parameter m that minimizes my variable x subject to a system of differential equations. I have the following code
from gekko import GEKKO
def run_model_m(days, population, case, k_val, b_val, u0_val, sigma_val, Kmax0, a_val, c_val):
list_x =[]
list_u =[]
list_Kmax =[]
for i in range(len(days)):
list_xi=[]
list_ui=[]
list_Ki=[]
for j in range(len(days[i])):
#try:
m = GEKKO(remote=False)
#m.time= days[i][j]
eval = np.linspace(days[i][j][0], days[i][j][-1], 100, endpoint=True)
m.time = eval
x_data= population[i][j]
variable= np.linspace(population[i][j][0], population[i][j][-1], 100, endpoint=True)
x = m.Var(value=population[i][j][0], lb=0)
sigma= m.Param(sigma_val)
d = m.Param(c_val)
k = m.Param(k_val)
b = m.Param(b_val)
r = m.Param(a_val)
step = np.ones(len(eval))
step= 0.2*step
step[0]=1
m_param = m.CV(value=1, lb=0, ub=1, integer=True); m_param.STATUS=1
u = m.Var(value=u0_val, lb=0, ub=1)
#m.free(u)
a = m.Param(a_val)
c= m.Param(c_val)
Kmax= m.Param(Kmax0)
if case == 'case0':
m.Equations([x.dt()== x*(r*(1-x/(Kmax))-m_param/(k+b*u)-d), u.dt()== sigma*(m_param*b/((k+b*u)**2))])
elif case == 'case4':
m.Equations([x.dt()== x*(r*(1-u**2)*(1-x/(Kmax))-m_param/(k+b*u)-d), u.dt() == sigma*(-2*u*r*(1-x/(Kmax))+(b*m_param)/(b*u+k)**2)])
p = np.zeros(len(eval))
p[-1] = 1.0
final = m.Param(value=p)
m.Obj(x)
m.options.IMODE = 6
m.options.MAX_ITER=15000
m.options.SOLVER=1
# optimize
m.solve(disp=False, GUI=False)
#m.open_folder(dataset_path+'inf')
list_xi.append(x.value)
list_ui.append(u.value)
list_Ki.append(m_param.value)
list_x.append(list_xi)
list_Kmax.append(list_Ki)
list_u.append(list_ui)
return list_x, list_u, list_Kmax, m.options.OBJFCNVAL
scaled_days[i][j] =[-7.0, 42.0, 83.0, 125.0, 167.0, 217.0, 258.0, 300.0, 342.0]
scaled_pop[i][j] = [0.01762491277346285, 0.020592540360308997, 0.017870838266697213, 0.01690069378982034,0.015512320147187675,0.01506701796298272,0.014096420738841563,0.013991224004743027,0.010543380664478205]
k0,b0,group, case0, u0, sigma0, K0, a0, c0 = (100, 20, 'Size3, Inc', 'case0', 0.1, 0.05, 2, 0, 0.01)
list_x2, list_u2, list_Kmax2,final =run_model_m(days=[[scaled_days[i][j]]], population=
[[scaled_pop[i][j]]],case=case1, k_val=list_b1[i0][0], b_val=b1, u0_val=list_u1[i0][j0],
sigma_val=sigma1, Kmax0=K1, a_val=list_Kmax1[0][0], c_val=c1)
I get the error Data arrays must have the same length, and match time discretization in dynamic problems error but I don't understand why. I have tried making x and m_param arrays, with x=m.Var, m_param =m.MV... But still get the same error, even if they are all arrays of the same length. Is this the right way to find the solution of the minimization problem?
I think the error was just that in run_model_m I was passing a list as u0_val and it didn't have the same dimensions as m.time. So it should be u0_val=list_u1[0][0][0]

Increase speed creation for masked xarray file

I am currently trying to crop a retangular xarray file to the shape of a country using a mask grid. Below you can find my current solution (with simpler and smaller arrays). The code works and I get the desired mask based on 1s and 0s. The problem lies on the fact that the code when run on a real country shape (larger and more complex) takes over 30 minutes to run. Since I am using very basic operations here like nested for loops, I also tried different alternatives like a list approach. However, when timing the process, it did not improve on the code below. I wonder if there is a faster way to obtain this mask (vectorization?) or if I should approach the problem in a different way (tried exploring xarray's properties, but have not found anything that tackles this issue yet).
Code below:
import geopandas as gpd
from shapely.geometry import Polygon, Point
import pandas as pd
import numpy as np
import xarray as xr
df = pd.read_csv('Brazil_borders.csv',index_col=0)
lats = np.array([-20, -5, -5, -20,])
lons = np.array([-60, -60, -30, -30])
lats2 = np.array([-10.25, -10.75, -11.25, -11.75, -12.25, -12.75, -13.25, -13.75,
-14.25, -14.75, -15.25, -15.75, -16.25, -16.75, -17.25, -17.75,
-18.25, -18.75, -19.25, -19.75, -20.25, -20.75, -21.25, -21.75,
-22.25, -22.75, -23.25, -23.75, -24.25, -24.75, -25.25, -25.75,
-26.25, -26.75, -27.25, -27.75, -28.25, -28.75, -29.25, -29.75,
-30.25, -30.75, -31.25, -31.75, -32.25, -32.75])
lons2 = np.array([-61.75, -61.25, -60.75, -60.25, -59.75, -59.25, -58.75, -58.25,
-57.75, -57.25, -56.75, -56.25, -55.75, -55.25, -54.75, -54.25,
-53.75, -53.25, -52.75, -52.25, -51.75, -51.25, -50.75, -50.25,
-49.75, -49.25, -48.75, -48.25, -47.75, -47.25, -46.75, -46.25,
-45.75, -45.25, -44.75, -44.25])
points = []
for i in range(len(lats)):
_= [lats[i],lons[i]]
points.append(_)
poly_proj = Polygon(points)
mask = np.zeros((len(lats2),len(lons2))) # Mask with the dataset's shape and size.
for i in range(len(lats2)): # Iteration to verify if a given coordinate is within the polygon's area
for j in range(len(lons2)):
grid_point = Point(lats2[i], lons2[j])
if grid_point.within(poly_proj):
mask[i][j] = 1
bool_final = mask
bool_final
The alternative based on list approach, but with even worse processing time (according to timeit):
lats = np.array([-20, -5, -5, -20,])
lons = np.array([-60, -60, -30, -30])
lats2 = np.array([-10.25, -10.75, -11.25, -11.75, -12.25, -12.75, -13.25, -13.75,
-14.25, -14.75, -15.25, -15.75, -16.25, -16.75, -17.25, -17.75,
-18.25, -18.75, -19.25, -19.75, -20.25, -20.75, -21.25, -21.75,
-22.25, -22.75, -23.25, -23.75, -24.25, -24.75, -25.25, -25.75,
-26.25, -26.75, -27.25, -27.75, -28.25, -28.75, -29.25, -29.75,
-30.25, -30.75, -31.25, -31.75, -32.25, -32.75])
lons2 = np.array([-61.75, -61.25, -60.75, -60.25, -59.75, -59.25, -58.75, -58.25,
-57.75, -57.25, -56.75, -56.25, -55.75, -55.25, -54.75, -54.25,
-53.75, -53.25, -52.75, -52.25, -51.75, -51.25, -50.75, -50.25,
-49.75, -49.25, -48.75, -48.25, -47.75, -47.25, -46.75, -46.25,
-45.75, -45.25, -44.75, -44.25])
points = []
for i in range(len(lats)):
_= [lats[i],lons[i]]
points.append(_)
poly_proj = Polygon(points)
grid_point = [Point(lats2[i],lons2[j]) for i in range(len(lats2)) for j in range(len(lons2))]
mask = [1 if grid_point[i].within(poly_proj) else 0 for i in range(len(grid_point))]
bool_final2 = np.reshape(mask,(((len(lats2)),(len(lons2)))))
Thank you in advance!
Based on this answer from snowman2, I created this simple function that provides a much faster solution by using geopandas and rioxarray. Instead of using a list of latitudes and longitudes, one has to use a shapefile with the desired shape to be masked (Instructions for GeoDataFrame creation from list of coordinates).
import xarray as xr
import geopandas as gpd
import rioxarray
from shapely.geometry import mapping
def mask_shape_border (DS,shape_shp): #Inputs are the dataset to be cropped and the address of the mask file (.shp )
if 'lat' in DS: #Some datasets use lat/lon, others latitude/longitude
DS.rio.set_spatial_dims(x_dim="lon", y_dim="lat", inplace=True)
elif 'latitude' in DS:
DS.rio.set_spatial_dims(x_dim="longitude", y_dim="latitude", inplace=True)
else:
print("Error: check latitude and longitude variable names.")
DS.rio.write_crs("epsg:4326", inplace=True)
mask = gpd.read_file(shape_shp, crs="epsg:4326")
DS_clipped = DS.rio.clip(mask.geometry.apply(mapping), mask.crs, drop=False)
return(DS_clipped)

Tensorflow Probability Logistic Regression Example

I feel I must be missing something obvious, in struggling to get a positive control for logistic regression going in tensorflow probability.
I've modified the example for logistic regression here, and created a positive control features and labels data. I struggle to achieve accuracy over 60%, however this is an easy problem for a 'vanilla' Keras model (accuracy 100%). What am I missing? I tried different layers, activations, etc.. With this method of setting up the model, is posterior updating actually being performed? Do I need to specify an interceptor object? Many thanks..
### Added positive control
nSamples = 80
features1 = np.float32(np.hstack((np.reshape(np.ones(40), (40, 1)),
np.reshape(np.random.randn(nSamples), (40, 2)))))
features2 = np.float32(np.hstack((np.reshape(np.zeros(40), (40, 1)),
np.reshape(np.random.randn(nSamples), (40, 2)))))
features = np.vstack((features1, features2))
labels = np.concatenate((np.zeros(40), np.ones(40)))
featuresInt, labelsInt = build_input_pipeline(features, labels, 10)
###
#w_true, b_true, features, labels = toy_logistic_data(FLAGS.num_examples, 2)
#featuresInt, labelsInt = build_input_pipeline(features, labels, FLAGS.batch_size)
with tf.name_scope("logistic_regression", values=[featuresInt]):
layer = tfp.layers.DenseFlipout(
units=1,
activation=None,
kernel_posterior_fn=tfp.layers.default_mean_field_normal_fn(),
bias_posterior_fn=tfp.layers.default_mean_field_normal_fn())
logits = layer(featuresInt)
labels_distribution = tfd.Bernoulli(logits=logits)
neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labelsInt))
kl = sum(layer.losses)
elbo_loss = neg_log_likelihood + kl
predictions = tf.cast(logits > 0, dtype=tf.int32)
accuracy, accuracy_update_op = tf.metrics.accuracy(
labels=labelsInt, predictions=predictions)
with tf.name_scope("train"):
optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)
train_op = optimizer.minimize(elbo_loss)
init_op = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
with tf.Session() as sess:
sess.run(init_op)
# Fit the model to data.
for step in range(FLAGS.max_steps):
_ = sess.run([train_op, accuracy_update_op])
if step % 100 == 0:
loss_value, accuracy_value = sess.run([elbo_loss, accuracy])
print("Step: {:>3d} Loss: {:.3f} Accuracy: {:.3f}".format(
step, loss_value, accuracy_value))
### Check with basic Keras
kerasModel = tf.keras.models.Sequential([
tf.keras.layers.Dense(1)])
optimizer = tf.train.AdamOptimizer(5e-2)
kerasModel.compile(optimizer = optimizer, loss = 'binary_crossentropy',
metrics = ['accuracy'])
kerasModel.fit(features, labels, epochs = 50) #100% accuracy
Compared to the github example, you forgot to divide by the number of examples when defining the KL divergence:
kl = sum(layer.losses) / FLAGS.num_examples
When I change this to your code, I quickly get to an accuracy of 99.9% on your toy data.
Additionaly, the output layer of your Keras model actually expects a sigmoid activation for this problem (binary classification):
kerasModel = tf.keras.models.Sequential([
tf.keras.layers.Dense(1, activation='sigmoid')])
It's a toy problem, but you will notice that the model gets to 100% accuracy faster with a sigmoid activation.

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Resources