Sagemaker Hyperparameter Optimization XGBoost - amazon-sagemaker

I am trying to build a hyperparameter optimization job in Amazon Sagemaker, in python, but something is not working. Here is what I have:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
role,
train_instance_count=1,
train_instance_type='ml.m4.4xlarge',
output_path=output_path_1,
base_job_name='HPO-xgb',
sagemaker_session=sess)
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter
hyperparameter_ranges = {'eta': ContinuousParameter(0.01, 0.2),
'num_rounds': ContinuousParameter(100, 500),
'num_class': 4,
'max_depth': IntegerParameter(3, 9),
'gamma': IntegerParameter(0, 5),
'min_child_weight': IntegerParameter(2, 6),
'subsample': ContinuousParameter(0.5, 0.9),
'colsample_bytree': ContinuousParameter(0.5, 0.9)}
objective_metric_name = 'validation:mlogloss'
objective_type='minimize'
metric_definitions = [{'Name': 'validation-mlogloss',
'Regex': 'validation-mlogloss=([0-9\\.]+)'}]
tuner = HyperparameterTuner(xgb,
objective_metric_name,
objective_type,
hyperparameter_ranges,
metric_definitions,
max_jobs=9,
max_parallel_jobs=3)
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
And the error I get is:
AttributeError: 'str' object has no attribute 'keys'
The error seems to come from the tuner.py file:
----> 1 tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tuner.py in fit(self, inputs, job_name, **kwargs)
144 self.estimator._prepare_for_training(job_name)
145
--> 146 self._prepare_for_training(job_name=job_name)
147 self.latest_tuning_job = _TuningJob.start_new(self, inputs)
148
~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/tuner.py in _prepare_for_training(self, job_name)
120
121 self.static_hyperparameters = {to_str(k): to_str(v) for (k, v) in self.estimator.hyperparameters().items()}
--> 122 for hyperparameter_name in self._hyperparameter_ranges.keys():
123 self.static_hyperparameters.pop(hyperparameter_name, None)
124
AttributeError: 'list' object has no attribute 'keys'

Your arguments when initializing the HyperparameterTuner object are in the wrong order. The constructor has the following signature:
HyperparameterTuner(estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions=None,
strategy='Bayesian',
objective_type='Maximize',
max_jobs=1,
max_parallel_jobs=1,
tags=None,
base_tuning_job_name=None)
so in this case, your objective_type is in the wrong position. See the docs for more details.

Related

Tensorflow: convert PrefetchDataset to BatchDataset

Tensorflow: convert PrefetchDataset to BatchDataset
With latest Tensorflow version 2.3.1I am trying to follow basic text classification example at: https://www.tensorflow.org/tutorials/keras/text_classification. Instead of creating dataset from directory as example does, I am using a csv file:
SELECT_COLUMNS = ['SentimentText','Sentiment']
LABEL_COLUMN = 'Sentiment'
LABELS = [0, 1]
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=3, # Artificially small to make examples easier to show.
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)
As a result I get:
type(all_data)
tensorflow.python.data.ops.dataset_ops.PrefetchDataset
Example loads data from directory with:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
'aclImdb/train',
batch_size=batch_size,
validation_split=0.2,
subset='training',
seed=seed)
And gets dataset of another type:
type(raw_train_ds)
tensorflow.python.data.ops.dataset_ops.BatchDataset
Now when I try to standardise and vectorise data with functions from example:
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation),
'')
max_features = 10000
sequence_length = 250
vectorize_layer = TextVectorization(
standardize=custom_standardization,
max_tokens=max_features,
output_mode='int',
output_sequence_length=sequence_length)
And apply them to my dataset I get error:
# Make a text-only dataset (without labels), then call adapt
train_text = all_data.map(lambda x, y: x)
vectorize_layer.adapt(train_text)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-1f1fc445912d> in <module>
1 # Make a text-only dataset (without labels), then call adapt
2 train_text = all_data.map(lambda x, y: x)
----> 3 vectorize_layer.adapt(train_text)
/opt/anaconda3/lib/python3.7/site-packages/tensorflow/python/keras/layers/preprocessing/text_vectorization.py in adapt(self, data, reset_state)
378 shape = dataset_ops.get_legacy_output_shapes(data)
379 if not isinstance(shape, tensor_shape.TensorShape):
--> 380 raise ValueError("The dataset passed to 'adapt' must contain a single "
381 "tensor value.")
382 if shape.rank == 0:
ValueError: The dataset passed to 'adapt' must contain a single tensor value.
How to convert PrefetchDataset to BatchDataset ?
You could use tf.stack method to pack the features into a single array. The below function is from Custom training: walkthrough in Tensorflow.
def pack_features_vector(features, labels):
features = tf.stack(list(features.values()), axis=1)
return features, labels
all_data = get_dataset(data_path, select_columns=SELECT_COLUMNS)
train_dataset = all_data.map(pack_features_vector)
train_text = train_dataset.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

Complex number in Fenics

I am currently trying to solve a complex-valued PDE with Fenics in a jupyter notebook but I am having trouble when I try to use a complex number in Fenics.
Here is how I've defined the variational problem:
u = TrialFunction(V)
v = TestFunction(V)
a = (inner(grad(u[0]), grad(v[0])) + inner(grad(u[1]), grad(v[1])))*dx + sin(lat)*(u[0]*v[1]-u[1]*v[0])*dx+1j*((-inner(grad(u[0]), grad(v[1])) + inner(grad(u[1]), grad(v[0])))*dx + (sin(lat)*(u[0]*v[0]-u[1]*v[1])*dx))
f = Constant((1.0,1.0))
b = (v[0]*f[0]+f[1]*v[1])*ds+1j*((f[1]*v[0]-f[0]*v[1])*ds)
I got the following error message:
AttributeError Traceback (most recent call last)
<ipython-input-74-7760afa5a395> in <module>()
1 u = TrialFunction(V)
2 v = TestFunction(V)
----> 3 a = (inner(grad(u[0]), grad(v[0])) + inner(grad(u[1]), grad(v[1])))*dx + sin(lat)*(u[0]*v[1]-u[1]*v[0])*dx+1j*((-inner(grad(u[0]), grad(v[1])) + inner(grad(u[1]), grad(v[0])))*dx + (sin(lat)*(u[0]*v[0]-u[1]*v[1])*dx)
4 f = Constant((0.0,0.0))
5 b = (v[0]*f[0]+f[1]*v[1])*ds+1j*((f[1]*v[0]-f[0]*v[1])*ds)
~/anaconda3_420/lib/python3.5/site-packages/ufl/form.py in __rmul__(self, scalar)
305 "Multiply all integrals in form with constant scalar value."
306 # This enables the handy "0*form" or "dt*form" syntax
--> 307 if is_scalar_constant_expression(scalar):
308 return Form([scalar*itg for itg in self.integrals()])
309 return NotImplemented
~/anaconda3_420/lib/python3.5/site-packages/ufl/checks.py in is_scalar_constant_expression(expr)
84 if is_python_scalar(expr):
85 return True
---> 86 if expr.ufl_shape:
87 return False
88 return is_globally_constant(expr)
AttributeError: 'complex' object has no attribute 'ufl_shape'
Could someone please help me?
By the way, Fenics might not be the best tool to solve complex-valued PDE and I would like to read your suggestions about such problems.

NoSuchEntityException: An error occurred (NoSuchEntity) when calling the GetRole operation: The user with name <name> cannot be found

Call to get_execution_role() from notebook instance fails with the error message NoSuchEntityException: An error occurred (NoSuchEntity) when calling the GetRole operation: The user with name <name> cannot be found.
Stack trace:
NoSuchEntityExceptionTraceback (most recent call last)
<ipython-input-1-1e2d3f162cfe> in <module>()
5 sagemaker_session = sagemaker.Session()
6
----> 7 role = get_execution_role()
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in get_execution_role(sagemaker_session)
871 if not sagemaker_session:
872 sagemaker_session = Session()
--> 873 arn = sagemaker_session.get_caller_identity_arn()
874
875 if 'role' in arn:
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/session.pyc in get_caller_identity_arn(self)
701 # Call IAM to get the role's path
702 role_name = role[role.rfind('/') + 1:]
--> 703 role = self.boto_session.client('iam').get_role(RoleName=role_name)['Role']['Arn']
704
705 return role
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
312 "%s() only accepts keyword arguments." % py_operation_name)
313 # The "self" in this scope is referring to the BaseClient.
--> 314 return self._make_api_call(operation_name, kwargs)
315
316 _api_call.__name__ = str(py_operation_name)
/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
610 error_code = parsed_response.get("Error", {}).get("Code")
611 error_class = self.exceptions.from_code(error_code)
--> 612 raise error_class(parsed_response, operation_name)
613 else:
614 return parsed_response
NoSuchEntityException: An error occurred (NoSuchEntity) when calling the GetRole operation: The user with name <name> cannot be found.
However using boto client directly to get info about the role succeeds. This works fine:
response = client.get_role(
RoleName='role-name',
)['Role']['Arn']
Turns out this is some weird bug that goes away if you stop and start the notebook instance.
I have shutdown and run again the notebook and it works.
PD: I have to run again the code to make effect.

TypeError: ufunc 'add' did not contain a loop

I use Anaconda and gdsCAD and get an error when all packages are installed correctly.
Like explained here: http://pythonhosted.org/gdsCAD/
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
My imports look like this (In the end I imported everything):
import numpy as np
from gdsCAD import *
import matplotlib.pyplot as plt
My example code looks like this:
something = core.Elements()
box=shapes.Box( (5,5),(1,5),0.5)
core.default_layer = 1
core.default_colors = 2
something.add(box)
something.show()
My error message looks like this:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-2f90b960c1c1> in <module>()
31 puffer_wafer = shapes.Circle((0.,0.), puffer_wafer_radius, puffer_line_thickness)
32 bp.add(puffer_wafer)
---> 33 bp.show()
34 wafer = shapes.Circle((0.,0.), wafer_radius, wafer_line_thickness)
35 bp.add(wafer)
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in _show(self)
80 ax.margins(0.1)
81
---> 82 artists=self.artist()
83 for a in artists:
84 a.set_transform(a.get_transform() + ax.transData)
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in artist(self, color)
952 art=[]
953 for p in self:
--> 954 art+=p.artist()
955 return art
956
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in artist(self, color)
475 poly = lines.buffer(self.width/2.)
476
--> 477 return [descartes.PolygonPatch(poly, lw=0, **self._layer_properties(self.layer))]
478
479
C:\Users\rpilz\AppData\Local\Continuum\Anaconda2\lib\site-packages\gdscad-0.4.5-py2.7.egg\gdsCAD\core.pyc in _layer_properties(layer)
103 # Default colors from previous versions
104 colors = ['k', 'r', 'g', 'b', 'c', 'm', 'y']
--> 105 colors += matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15))
106 color = colors[layer % len(colors)]
107 return {'color': color}
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The gdsCAD has been a pain from shapely install to this plotting issue.
This issue is because of wrong datatype being passed to colors function. It can be solved by editing the following line in core.py
colors += matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15))
to
colors += list(matplotlib.cm.gist_ncar(np.linspace(0.98, 0, 15)))
If you dont know where the core.py is located. Just type in:
from gdsCAD import *
core
This will give you the path of core.py file. Good luck !
Well first, I'd ask that you please include actual code, as the 'example code' in the file is obviously different based on the traceback. When debugging, the details matter, and I need to be able to actually run the code.
You obviously have a data type problem. Chances are pretty good it's in the variables here:
puffer_wafer = shapes.Circle((0.,0.), puffer_wafer_radius, puffer_line_thickness)
I had the same error thrown when I was running a call to Pandas. I changed the data to str(data) and the code worked.
I don't know if this helps I am fairly new to this myself, but I had a similar error and found that it is due to a type casting issue as suggested by previous answer. I can't see from the example in the question exactly what you are trying to do. Below is a small example of my issue and solution. My code is making a simple Random Forest model using scikit learn.
Here is an example that will give the error and it is caused by the third to last line, concatenating the results to write to file.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
This leads to an error of;
Traceback (most recent call last):
File "min_example.py", line 40, in <module>
fpred.write(RFpreds[i]+",,"+yTest[i]+",\n")
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('S32') dtype('S32') dtype('S32')
The solution is to make each variable a str() type on the third to last line then write to file. No other changes to then code have been made from the above.
import scipy
import math
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing, metrics, cross_validation
Data = pd.read_csv("Free_Energy_exp.csv", sep=",")
Data = Data.fillna(Data.mean()) # replace the NA values with the mean of the descriptor
header = Data.columns.values # Ues the column headers as the descriptor labels
Data.head()
test_name = "Test.csv"
npArray = np.array(Data)
print header.shape
npheader = np.array(header[1:-1])
print("Array shape X = %d, Y = %d " % (npArray.shape))
datax, datay = npArray.shape
names = npArray[:,0]
X = npArray[:,1:-1].astype(float)
y = npArray[:,-1] .astype(float)
X = preprocessing.scale(X)
XTrain, XTest, yTrain, yTest = cross_validation.train_test_split(X,y, random_state=0)
# Predictions results initialised
RFpredictions = []
RF = RandomForestRegressor(n_estimators = 10, max_features = 5, max_depth = 5, random_state=0)
RF.fit(XTrain, yTrain) # Train the model
print("Training R2 = %5.2f" % RF.score(XTrain,yTrain))
RFpreds = RF.predict(XTest)
with open(test_name,'a') as fpred :
lenpredictions = len(RFpreds)
lentrue = yTest.shape[0]
if lenpredictions == lentrue :
fpred.write("Names/Label,, Prediction Random Forest,, True Value,\n")
for i in range(0,lenpredictions) :
fpred.write(str(RFpreds[i])+",,"+str(yTest[i])+",\n")
else :
print "ERROR - names, prediction and true value array size mismatch."
These examples are from a larger code so I hope the examples are clear enough.

Django/pyodbc error: not enough arguments for format string

I have a Dictionary model defined in Django (1.6.5). One method (called get_topentities) returns the top names in my dictionary (entity names are defined by Entity model):
def get_topentities(self,n):
entities = self.entity_set.select_related().filter(in_dico=True,table_type=0).order_by("rank")[0:n]
return entities
When I call the function (say with n=2), it returns the top 2 elements but I cannot access the second one because of this "not enough arguments to format string" error:
In [5]: d = Dictionary.objects.get(code='USA')
In [6]: top2 = d.get_topentities(2)
In [7]: top2
Out[7]: [<Entity: BARACK OBAMA>, <Entity: GOVERNMENT>]
In [8]: top2[0]
Out[8]: <Entity: BARACK OBAMA>
In [9]: top2[1]
.
.
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in as_sql(self, with_limits, with_col_aliases)
172 # Lop off ORDER... and the initial "SELECT"
173 inner_select = _remove_order_limit_offset(raw_sql)
--> 174 outer_fields, inner_select = self._alias_columns(inner_select)
175
176 order = _get_order_limit_offset(raw_sql)[0]
/usr/local/lib/python2.7/dist-packages/django_pyodbc/compiler.pyc in _alias_columns(self, sql)
339
340 # store the expanded paren string
--> 341 parens[key] = buf% parens
342 #cannot use {} because IBM's DB2 uses {} as quotes
343 paren_buf[paren_depth] += '(%(' + key + ')s)'
TypeError: not enough arguments for format string
In [10]:
My server backend is MSSQL and I'm using pyodbc as the database driver. If I try the same on a PC with engine sqlserver_ado, it works. Can someone help?
Regards,
Patrick

Resources