I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:
“Failure reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.”
Their code handles multi gpu usage and currently works well on their machine outside of AWS. Do you have any documentation that you can point me to help them address the problem? They are currently using PyTorch for all of their model development.
Looks like they are running Hyperparameter Optimization (HPO) on Sagemaker and no metrics is being emitted by their code that allows HPO to tune. It is a problem with how they specify regular expression objective metric, for more details see SageMaker Estimator Metrics Definitions.
Essentially use a tool like https://regex101.com to validate the regex they use extracts the objective number from their training logs.
Related
For anyone with experience with fastai’s distributed training (either within SageMaker or outside it):
Are there any material benefits to using it over PyTorch DDP (which it’s built on top of)?
What would be the easiest way to incorporate this inside a SM training job?
It requires the training script to be run like python -m fastai.launch scriptname.py ...args... so using Script Mode is not immediately straightforward. Pointing to a .sh file for the entry_point in the PyTorch estimator means that SageMaker will not pip install from a provided requirements.txt so the user must control all of this inside their bash script.
Can SM Distributed Data Parallel be used with the fastAI distributed training? Or do we need to utilize Pytorch DDP instead of fastai in order to use SM DDP?
We’re currently running using Pytorch Lightning for training outside of SageMaker. Looking to use SageMaker to leverage distributed training, checkpointing, model training optimization(training compiler) etc to accelerate training process and save costs. Whats the recommended way to migrate their PyTorch Lightning scripts to run on SageMaker?
The easiest way to run Pytorch Lightning on SageMaker is to use the SageMaker PyTorch estimator (example) to get started. Ideally you will have add a requirement.txt for installing pytorch lightning along with your source code.
Regarding distributed training Amazon SageMaker recently launched native support for running Pytorch lightning based distributed training. Please follow the below link to setup your training code
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html
https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/
There's no big difference in running PyTorch Lightning and plain PyTorch scripts with SageMaker.
One caveat, however, when running distributed training jobs with DDPPlugin, is to set properly the NODE_RANK environment variable at the beginning of the script, because PyTorch Lightning knows nothing about SageMaker environment variables and relies on generic cluster variables:
os.environ["NODE_RANK"] = str(int(os.environ.get("CURRENT_HOST", "algo-1")[5:]) - 1)
or (more robust):
rc = json.loads(os.environ.get("SM_RESOURCE_CONFIG", "{}"))
os.environ["NODE_RANK"] = str(rc["hosts"].index(rc["current_host"]))
Since your question is specific to migration of already working code into Sagemaker, using the link here as reference, I can try to break the process into 3 parts :
Create a Pytorch Estimator - estimator
import sagemaker
sagemaker_session = sagemaker.Session()
pytorch_estimator = PyTorch(
entry_point='my_model.py',
instance_type='ml.g4dn.16xlarge',
instance_count=1,
framework_version='1.7',
py_version='py3',
output_path: << s3 bucket >>,
source_dir = << path for my_model.py >> ,
sagemaker_session=sagemaker_session)
entry_point = "my_model.py" - this part should be your existing Pytorch Lightning script. In the main method you can have something like this:
if __name__ == '__main__':
import pytorch_lightning as pl
trainer = pl.Trainer(
devices=-1, ## in order to utilize all GPUs
accelerator="gpu",
strategy="ddp",
enable_checkpointing=True,
default_root_dir="/opt/ml/checkpoints",
)
model=estimator.fit()
Also , the link here explains the coding process very well .
https://vision.unipv.it/events/Bianchi_giu2021-Introduction-PyTorch-Lightning.pdf
I'd like to (a) plot SHAP values out of the SageMaker (b) AutoML pipeline. To achieve (a), debugger shall be used according to: https://aws.amazon.com/blogs/machine-learning/ml-explainability-with-amazon-sagemaker-debugger/.
But how to enable the debug model in the AutoPilot without hacking into the background?
SageMaker Autopilot doesn't support SageMaker Debugger out of the box currently (as of Dec 2020). You can hack the Hyperparameter Tuning job to pass in a debug parameter.
However, there is a way to use SHAP with Autopilot models. Take a look at this blog post explaining how to use SHAP with SageMaker Autopilot: https://aws.amazon.com/blogs/machine-learning/explaining-amazon-sagemaker-autopilot-models-with-shap/.
Wondering if anyone has a full working example of how to make a ZCL endpoint using Digi's Xbee ANSI C Library ?
The samples directory in that repo has some things, the commission server sample is helpful but I'd love to see an example of the library actually being used for something real.
What I'm trying to make here is a simple sensor to interface with an existing Zigbee network (the coordinator being zigbee2mqtt with a cc2531 in my case) to report readings to home assistant.
I've seen mentions of a "xbee custom endpoint" example on the Digi forum, but I couldn't find that example, it sounds like that'd be exactly what I need.
Thanks
The Commissioning Client and Server samples are overkill for just getting started, but they are used for "something real". The Commissioning Cluster is a part of the Zigbee spec.
You might want to look at zcl_comm_startup_attributes and zcl_comm_startup_attribute_tree in src/zigbee/zcl_commissioning.c to see how you can set up an attribute tree for your cluster.
Perhaps look at include/zigbee/zcl_basic_attributes.h and samples/common/_zigbee_walker.c on how to set up the endpoint table with a Basic cluster and its attributes. The Zigbee Walker sample shows how to use ZDO/ZDP queries to enumerate endpoints, and then ZCL queries to enumerate clusters and attributes. You can use that sample to validate the endpoint/cluster/attribute table that you've set up in a particular program.
You might want to spend some time reading through the Zigbee Cluster Library specification to understand the concept of endpoints, clusters and attributes, which may help you to understand the tables you need to set up in your program to implement them.
I am trying out Amazon Sagemaker, I haven't figured out how we can have Continuous training.
For example if i have a CSV file in s3 and I want to train each time the CSV file is updated.
I know we can go again to the notebook and re-run the whole notebook to make this happen.
But i am looking for an automated way, with some python scripts or using a lambda function with s3 events etc
You can use boto3 sdk for python to start training on lambda then you need to trigger the lambda when csv is update.
http://boto3.readthedocs.io/en/latest/reference/services/sagemaker.html
Example python code
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
Addition: You dont need to use lambda you just start/cronjob the python script any kind of instance which has python and aws sdk in it.
There are a couple examples for how to accomplish this in the aws-samples GitHub.
The serverless-sagemaker-orchestration example sounds most similar to the use case you are describing. This example walks you through how to continuously train a SageMaker linear regression model for housing price predictions on new CSV data that is added daily to a S3 bucket using the built-in LinearLearner algorithm, orchestrated with Amazon CloudWatch Events, AWS Step Functions, and AWS Lambda.
There is also the similar aws-sagemaker-build example but it might be more difficult to follow currently if you are looking for detailed instructions.
Hope this helps!