fastai distributed training in SageMaker? - amazon-sagemaker

For anyone with experience with fastai’s distributed training (either within SageMaker or outside it):
Are there any material benefits to using it over PyTorch DDP (which it’s built on top of)?
What would be the easiest way to incorporate this inside a SM training job?
It requires the training script to be run like python -m fastai.launch scriptname.py ...args... so using Script Mode is not immediately straightforward. Pointing to a .sh file for the entry_point in the PyTorch estimator means that SageMaker will not pip install from a provided requirements.txt so the user must control all of this inside their bash script.
Can SM Distributed Data Parallel be used with the fastAI distributed training? Or do we need to utilize Pytorch DDP instead of fastai in order to use SM DDP?

Related

Amazon SageMaker multi GPU: No objective found

I have a question on Sagemaker multi GPU - IHAC running their code in single gpu instances (ml.p3.2xlarge) but when they select ml.p3.8xlarge(multi gpu), it is running into the following error:
“Failure reason: No objective metrics found after running 5 training jobs. Please ensure that the custom algorithm is emitting the objective metric as defined by the regular expression provided.”
Their code handles multi gpu usage and currently works well on their machine outside of AWS. Do you have any documentation that you can point me to help them address the problem? They are currently using PyTorch for all of their model development.
Looks like they are running Hyperparameter Optimization (HPO) on Sagemaker and no metrics is being emitted by their code that allows HPO to tune. It is a problem with how they specify regular expression objective metric, for more details see SageMaker Estimator Metrics Definitions.
Essentially use a tool like https://regex101.com to validate the regex they use extracts the objective number from their training logs.

PyTorch Lightning with Amazon SageMaker

We’re currently running using Pytorch Lightning for training outside of SageMaker. Looking to use SageMaker to leverage distributed training, checkpointing, model training optimization(training compiler) etc to accelerate training process and save costs. Whats the recommended way to migrate their PyTorch Lightning scripts to run on SageMaker?
The easiest way to run Pytorch Lightning on SageMaker is to use the SageMaker PyTorch estimator (example) to get started. Ideally you will have add a requirement.txt for installing pytorch lightning along with your source code.
Regarding distributed training Amazon SageMaker recently launched native support for running Pytorch lightning based distributed training. Please follow the below link to setup your training code
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt-lightning.html
https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/
There's no big difference in running PyTorch Lightning and plain PyTorch scripts with SageMaker.
One caveat, however, when running distributed training jobs with DDPPlugin, is to set properly the NODE_RANK environment variable at the beginning of the script, because PyTorch Lightning knows nothing about SageMaker environment variables and relies on generic cluster variables:
os.environ["NODE_RANK"] = str(int(os.environ.get("CURRENT_HOST", "algo-1")[5:]) - 1)
or (more robust):
rc = json.loads(os.environ.get("SM_RESOURCE_CONFIG", "{}"))
os.environ["NODE_RANK"] = str(rc["hosts"].index(rc["current_host"]))
Since your question is specific to migration of already working code into Sagemaker, using the link here as reference, I can try to break the process into 3 parts :
Create a Pytorch Estimator - estimator
import sagemaker
sagemaker_session = sagemaker.Session()
pytorch_estimator = PyTorch(
entry_point='my_model.py',
instance_type='ml.g4dn.16xlarge',
instance_count=1,
framework_version='1.7',
py_version='py3',
output_path: << s3 bucket >>,
source_dir = << path for my_model.py >> ,
sagemaker_session=sagemaker_session)
entry_point = "my_model.py" - this part should be your existing Pytorch Lightning script. In the main method you can have something like this:
if __name__ == '__main__':
import pytorch_lightning as pl
trainer = pl.Trainer(
devices=-1, ## in order to utilize all GPUs
accelerator="gpu",
strategy="ddp",
enable_checkpointing=True,
default_root_dir="/opt/ml/checkpoints",
)
model=estimator.fit()
Also , the link here explains the coding process very well .
https://vision.unipv.it/events/Bianchi_giu2021-Introduction-PyTorch-Lightning.pdf

Dynamic Job Creation and Submission to Flink

Hi I am planning to use flink as a backend for my feature where we will show a UI to user to graphically create event patterns for eg: Multiple login failures from the same Ip address.
We will create the flink pattern programmatically using the given criteria by the user in the UI.
Is there any documentation on how to dynamically create the jar file and dynamically submit the job with it to flink cluster?
Is there any best practice for this kind of use case using apache flink?
The other way you can achieve that is that you can have one jar which contains something like an “interpreter” and you will pass to it the definition of your patterns in some format (e.g. json). After that “interpreter” translates this json to Flink’s operators. It is done in such a way in https://github.com/TouK/nussknacker/ Flink’s based execution engine. If you use such an approach you will need to handle redeployment of new definition in your own application.
One straightforward way to achieve this would be to generate a SQL script for each pattern (using MATCH_RECOGNIZE) and then use Ververica Platform's REST API to deploy and manage those scripts: https://docs.ververica.com/user_guide/application_operations/deployments/artifacts.html?highlight=sql#sql-script-artifacts
Flink doesn't provide tooling for automating the creation of JAR files, or submitting them. That's the sort of thing you might use a CI/CD pipeline to do (e.g., github actions).
Disclaimer: I work for Ververica.

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?
In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.
In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.
I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.
Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

Continuous Training in Sagemaker

I am trying out Amazon Sagemaker, I haven't figured out how we can have Continuous training.
For example if i have a CSV file in s3 and I want to train each time the CSV file is updated.
I know we can go again to the notebook and re-run the whole notebook to make this happen.
But i am looking for an automated way, with some python scripts or using a lambda function with s3 events etc
You can use boto3 sdk for python to start training on lambda then you need to trigger the lambda when csv is update.
http://boto3.readthedocs.io/en/latest/reference/services/sagemaker.html
Example python code
https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
Addition: You dont need to use lambda you just start/cronjob the python script any kind of instance which has python and aws sdk in it.
There are a couple examples for how to accomplish this in the aws-samples GitHub.
The serverless-sagemaker-orchestration example sounds most similar to the use case you are describing. This example walks you through how to continuously train a SageMaker linear regression model for housing price predictions on new CSV data that is added daily to a S3 bucket using the built-in LinearLearner algorithm, orchestrated with Amazon CloudWatch Events, AWS Step Functions, and AWS Lambda.
There is also the similar aws-sagemaker-build example but it might be more difficult to follow currently if you are looking for detailed instructions.
Hope this helps!

Resources