GridSearchCV with ALS algorithm - gridsearchcv

I am working on Pyspark
I use GridSearchCV with ALS algorithm
but I got an error .. any help ?
Thank you
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
coldStartStrategy="drop")
param_grid = {'rank': [10,50,100,150],
'regParam': [0.01,0.05,0.1,0.15]}
# run grid search
grid_search = GridSearchCV(als, param_grid=param_grid, scoring='accuracy')
start = time()
model_gridSeach=grid_search.fit(features_train,lable_train)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)
output :
Cannot clone object 'ALS_855af664ffc8' (type <class 'pyspark.ml.recommendation.ALS'>):
it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

As your output suggested, the created als object belongs to the class pyspark.ml.recommendation.ALS, which is not the estimator object from the class sklearn.model_selection.GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

Related

Lasso via GridSearchCV: ConvergenceWarning: Objective did not converge

I am trying to find the optimal parameter of a Lasso regression:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])
My variables are demeaned and standardized. But I get the following error:
ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.279e+02, tolerance: 6.395e-01
I know this error has been reported several times, but none of the previous posts answer how to solve it. In my case, the error is generated by the fact that the lowerbound 0.000005 is very small, but this is a reasonable value as indicated by solving the tuning problem via the information criteria:
lasso_aic = LassoLarsIC(criterion='aic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_aic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_aic.alpha_))
lasso_bic = LassoLarsIC(criterion='bic', fit_intercept=True, eps=1e-16, normalize=False)
lasso_bic.fit(X_train_std, y_train)
print('Lambda: {:.8f}'.format(lasso_bic.alpha_))
AIC and BIC give values of around 0.000008. How can this warning be solved?
Increasing the default parameter max_iter=1000 in Lasso will do the job:
alpha_tune = {'alpha': np.linspace(start=0.000005, stop=0.02, num=200)}
model_tuner = Lasso(fit_intercept=True, max_iter=5000)
cross_validation = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
model = GridSearchCV(estimator=model_tuner, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train_std, labels_train)
print(model.best_params_['alpha'])

cellvariable*Diffusion in fipy

I am trying to solve the following coupled pde's in fipy. I tried the following
eq1 = (DiffusionTerm(coeff=1, var=f)-f*DiffusionTerm(coeff=1, var=phi)
+f-f**3 == 0)
eq2 = (2*DiffusionTerm(coeff=f, var=phi)+f*DiffusionTerm(coeff=1, var=phi)
== 0)
eq = eq1 & eq2
eq.solve()
but it does not like "f*DiffusionTerm(coeff=1, var=phi)" and I get the error.
"TermMultiplyError: Must multiply terms by int or float." Is there a way that I can implement a cell variable times a diffusion term?
Neither of the following will work in FiPy,
from fipy import CellVariable, DiffusionTerm, Grid1D
mesh = Grid1D(nx=10)
var = CellVariable(mesh=mesh)
# eqn = var * DiffusionTerm(coeff=1)
eqn = ImplicitSourceTerm(coeff=DiffusionTerm(coeff=1))
eqn.solve(var)
They both simply don't make sense with respect to the discretization in the finite volume method. Regardless, you can use the following identity to rewrite the term of interest
Basically, instead of using
var * DiffusionTerm(coeff=1)
you can use
DiffusionTerm(coeff=var) - var.grad.mag**2
Giving a regular diffusion term and an extra explicit source term.

Titanic Dataset. Logistic Regression Model. Confusion Matrix gives 0 as output

I am running the logistic regression model on the Titanic Dataset with the below code:
#Modeling
#Split into train and test and fit the logistic regression model
titanic_train <- titanic_complete[1:891,]
titanic_test <- titanic_complete[892:1309,]
##############Logistic Regression ###############################
glm_model = glm(Survived~.,data= titanic_train, family = 'binomial')
summary(glm_model)
## Using anova() to analyze the table of devaiance
anova(glm_model, test="Chisq")
final_model = glm(Survived~Sex + Pclass + Age + SibSp + Cabin_f, data = titanic_train, family = 'binomial')
summary(final_model)
varImp(glm_model)
glm.pred <-predict(final_model, titanic_test, type = 'response')
glm.pred <- ifelse(glm.pred > 0.5, "yes", "no")
glm.pred
confusionMatrix(glm.pred, titanic_test$Survived)
As a result I've got this error message:
> confusionMatrix(glm.pred, titanic_test$Survived)
[1] no yes
<0 rows> (or 0-length row.names)
In Ops.factor(predictedScores, threshold) :
‘<’ not meaningful for factors
Cannot understand this error message and what is wrong. The model is working fine on the train data.
I assume it is related to the threshold i apply to the Survived variable (which is a factor variable - 1,0).

GAE ndb.query.filter() not working

I have a Story model that inherits from ndb.Model, with an IntegerProperty wordCount. I'm trying to query Story objects that have a specific word count range, but the query seems to return the same results, regardless of the filter properties.
For this code:
q = Story.query()
q.filter(Story.wordCount > 900)
for s in q.fetch(5):
print s.title / s.wordCount
I get this result:
If only ... / 884
Timed release / 953
Grandfather paradox / 822
Harnessing the brane-deer / 1618
Quantum erat demonstrandum / 908
Here's the story declaration:
class Story(ndb.Model):
title = ndb.StringProperty(required=True)
wordCount = ndb.IntegerProperty('wc')
I would expect to only get stories that have 900 words exactly--or none. Inequalities and sorting are also broken. I tried deploying to GAE, and I'm seeing the same broken results.
Any ideas on what would be causing this?
NDB queries are immutable, and when you call q.filter(Story.wordCount > 900) you're creating a new query, and not assigning it to anything. Re-assigning to your q variable should work for you:
q = Story.query()
q = q.filter(Story.wordCount > 900)
for s in q.fetch(5):
print s.title / s.wordCount

In Django how can I run a custom clean function on fixture data during import and validation?

In a ModelForm I can write a clean_<field_name> member function to automatically validate and clean up data entered by a user, but what can I do about dirty json or csv files (fixtures) during a manage.py loaddata?
Fixtures loaded with loaddata are assumed to contain clean data that doen't need validation (usually as an inverse operation to a prior dumpdata), so the short answer is that loaddata isn't the approach you want if you need to clean your inputs.
However, you probably can use some of the underpinnings of loaddata while implementing your custom data cleaning code--I'm sure you can easily script something using the Django serialization libs to read your existing data files them in and the save the resulting objects normally after the data has been cleaned up.
In case others want to do something similar, I defined a model method to do the cleaning (so it can be called from ModelForms)
MAX_ZIPCODE_DIGITS = 9
MIN_ZIPCODE_DIGITS = 5
def clean_zip_code(self, s=None):
#s = str(s or self.zip_code)
if not s: return None
s = re.sub("\D","",s)
if len(s)>self.MAX_ZIPCODE_DIGITS:
s = s[:self.MAX_ZIPCODE_DIGITS]
if len(s) in (self.MIN_ZIPCODE_DIGITS-1,self.MAX_ZIPCODE_DIGITS-1):
s = '0'+s # FIXME: deal with other intermediate lengths
if len(s)>=self.MAX_ZIPCODE_DIGITS:
s = s[:self.MIN_ZIPCODE_DIGITS]+'-'+s[self.MIN_ZIPCODE_DIGITS:]
return s
Then wrote a standalone python script to clean up my legacy json files using any clean_ methods found among the models.
import os, json
def clean_json(app = 'XYZapp', model='Entity', fields='zip_code', cleaner_prefix='clean_'):
# Set the DJANGO_SETTINGS_MODULE environment variable.
os.environ['DJANGO_SETTINGS_MODULE'] = app+".settings"
settings = __import__(app+'.settings').settings
models = __import__(app+'.models').models
fpath = os.path.join( settings.SITE_PROJECT_PATH, 'fixtures', model+'.json')
if isinstance(fields,(str,unicode)):
fields = [fields]
Ns = []
for field in fields:
try:
instance = getattr(models,model)()
except AttributeError:
print 'No model named %s could be found'%(model,)
continue
try:
cleaner = getattr(instance, cleaner_prefix+field)
except AttributeError:
print 'No cleaner method named %s.%s could be found'%(model,cleaner_prefix+field)
continue
print 'Cleaning %s using %s.%s...'%(fpath,model,cleaner.__name__)
fin = open(fpath,'r')
if fin:
l = json.load(fin)
before = len(l)
cleans = 0
for i in range(len(l)):
if 'fields' in l[i] and field in l[i]['fields']:
l[i]['fields'][field]=cleaner(l[i]['fields'][field]) # cleaner returns None to delete records
cleans += 1
fin.close()
after = len(l)
assert after>.5*before
Ns += [(before, after,cleans)]
print 'Writing %d/%d (new/old) records after %d cleanups...'%Ns[-1]
with open(fpath,'w') as fout:
fout.write(json.dumps(l,indent=2,sort_keys=True))
return Ns
if __name__ == '__main__':
clean_json()

Resources