PyTorch Lightning with Amazon SageMaker - amazon-sagemaker

We’re currently running using Pytorch Lightning for training outside of SageMaker. Looking to use SageMaker to leverage distributed training, checkpointing, model training optimization(training compiler) etc to accelerate training process and save costs. Whats the recommended way to migrate their PyTorch Lightning scripts to run on SageMaker?

The easiest way to run Pytorch Lightning on SageMaker is to use the SageMaker PyTorch estimator (example) to get started. Ideally you will have add a requirement.txt for installing pytorch lightning along with your source code.
Regarding distributed training Amazon SageMaker recently launched native support for running Pytorch lightning based distributed training. Please follow the below link to setup your training code
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html
https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/

There's no big difference in running PyTorch Lightning and plain PyTorch scripts with SageMaker.
One caveat, however, when running distributed training jobs with DDPPlugin, is to set properly the NODE_RANK environment variable at the beginning of the script, because PyTorch Lightning knows nothing about SageMaker environment variables and relies on generic cluster variables:
os.environ["NODE_RANK"] = str(int(os.environ.get("CURRENT_HOST", "algo-1")[5:]) - 1)
or (more robust):
rc = json.loads(os.environ.get("SM_RESOURCE_CONFIG", "{}"))
os.environ["NODE_RANK"] = str(rc["hosts"].index(rc["current_host"]))

Since your question is specific to migration of already working code into Sagemaker, using the link here as reference, I can try to break the process into 3 parts :
Create a Pytorch Estimator - estimator
import sagemaker
sagemaker_session = sagemaker.Session()
pytorch_estimator = PyTorch(
entry_point='my_model.py',
instance_type='ml.g4dn.16xlarge',
instance_count=1,
framework_version='1.7',
py_version='py3',
output_path: << s3 bucket >>,
source_dir = << path for my_model.py >> ,
sagemaker_session=sagemaker_session)
entry_point = "my_model.py" - this part should be your existing Pytorch Lightning script. In the main method you can have something like this:
if __name__ == '__main__':
import pytorch_lightning as pl
trainer = pl.Trainer(
devices=-1, ## in order to utilize all GPUs
accelerator="gpu",
strategy="ddp",
enable_checkpointing=True,
default_root_dir="/opt/ml/checkpoints",
)
model=estimator.fit()
Also , the link here explains the coding process very well .
https://vision.unipv.it/events/Bianchi_giu2021-Introduction-PyTorch-Lightning.pdf

Related

fastai distributed training in SageMaker?

For anyone with experience with fastai’s distributed training (either within SageMaker or outside it):
Are there any material benefits to using it over PyTorch DDP (which it’s built on top of)?
What would be the easiest way to incorporate this inside a SM training job?
It requires the training script to be run like python -m fastai.launch scriptname.py ...args... so using Script Mode is not immediately straightforward. Pointing to a .sh file for the entry_point in the PyTorch estimator means that SageMaker will not pip install from a provided requirements.txt so the user must control all of this inside their bash script.
Can SM Distributed Data Parallel be used with the fastAI distributed training? Or do we need to utilize Pytorch DDP instead of fastai in order to use SM DDP?

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?
In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.
In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.
I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.
Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

How to make inference on local PC with the model trained on AWS SageMaker by using the built-in algorithm Semantic Segmentation?

Similar to the issue of The trained model can be deployed on the other platform without dependency of sagemaker or aws service?.
I have trained a model on AWS SageMaker by using the built-in algorithm Semantic Segmentation. This trained model named as model.tar.gz is stored on S3. So I want to download this file from S3 and then use it to make inference on my local PC without using AWS SageMaker anymore. Since the built-in algorithm Semantic Segmentation is built using the MXNet Gluon framework and the Gluon CV toolkit, so I try to refer the documentation of mxnet and gluon-cv to make inference on local PC.
It's easy to download this file from S3, and then I unzip this file to get three files:
hyperparams.json: includes the parameters for network architecture, data inputs, and training. Refer to Semantic Segmentation Hyperparameters.
model_algo-1
model_best.params
Both model_algo-1 and model_best.params are the trained models, and I think it's the output from net.save_parameters (Refer to Train the neural network). I can also load them with the function mxnet.ndarray.load.
Refer to Predict with a pre-trained model. I found there are two necessary things:
Reconstruct the network for making inference.
Load the trained parameters.
As for reconstructing the network for making inference, since I have used PSPNet from training, so I can use the class gluoncv.model_zoo.PSPNet to reconstruct the network. And I know how to use some services of AWS SageMaker, for example batch transform jobs, to make inference. I want to reproduce it on my local PC. If I use the class gluoncv.model_zoo.PSPNet to reconstruct the network, I can't make sure whether the parameters for this network are same those used on AWS SageMaker while making inference. Because I can't see the image 501404015308.dkr.ecr.ap-northeast-1.amazonaws.com/semantic-segmentation:latest in detail.
As for loading the trained parameters, I can use the load_parameters. But as for model_algo-1 and model_best.params, I don't know which one I should use.
The following code works well for me.
import mxnet as mx
from mxnet import image
from gluoncv.data.transforms.presets.segmentation import test_transform
import gluoncv
# use cpu
ctx = mx.cpu(0)
# load test image
img = image.imread('./img/IMG_4015.jpg')
img = test_transform(img, ctx)
img = img.astype('float32')
# reconstruct the PSP network model
model = gluoncv.model_zoo.PSPNet(2)
# load the trained model
model.load_parameters('./model/model_algo-1')
# make inference
output = model.predict(img)
predict = mx.nd.squeeze(mx.nd.argmax(output, 1)).asnumpy()

Brewing up custom ML models on AWS SageMaker

Iam new with SageMaker and I try to use my own sickit-learn algorithm . For this I use Docker.
I try to do the same task as described here in this github account : https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
My question is should I create manually the repository /opt/ml (I work with windows OS) ?
Can you explain me please?
thank you
You don't need to create /opt/ml, SageMaker will do it for you when it launches your training job.
The contents of the /opt/ml directory are determined by the parameters you pass to the CreateTrainingJob API call. The scikit example notebook you linked to describes this (look at the Running your container sections). You can find more info about this in the Create a Training Job section of the main SageMaker documentation.

Continuous Training in Sagemaker

I am trying out Amazon Sagemaker, I haven't figured out how we can have Continuous training.
For example if i have a CSV file in s3 and I want to train each time the CSV file is updated.
I know we can go again to the notebook and re-run the whole notebook to make this happen.
But i am looking for an automated way, with some python scripts or using a lambda function with s3 events etc
You can use boto3 sdk for python to start training on lambda then you need to trigger the lambda when csv is update.
http://boto3.readthedocs.io/en/latest/reference/services/sagemaker.html
Example python code
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
Addition: You dont need to use lambda you just start/cronjob the python script any kind of instance which has python and aws sdk in it.
There are a couple examples for how to accomplish this in the aws-samples GitHub.
The serverless-sagemaker-orchestration example sounds most similar to the use case you are describing. This example walks you through how to continuously train a SageMaker linear regression model for housing price predictions on new CSV data that is added daily to a S3 bucket using the built-in LinearLearner algorithm, orchestrated with Amazon CloudWatch Events, AWS Step Functions, and AWS Lambda.
There is also the similar aws-sagemaker-build example but it might be more difficult to follow currently if you are looking for detailed instructions.
Hope this helps!

Resources