Amazon SageMaker ScriptMode Long Python Wheel Build Times for CUDA Components - amazon-sagemaker

I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.
The requirements.txt that gets installed into the Estimator container, has lines like:
torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.
SageMaker Estimator:
PyTorch v1.10
CUDA 11.x
Python 3.8
Instance: ml.p3.16xlarge
I have noticed the same issue with other wheel-based components that require CUDA.
I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs
Anything I can do to cut down these build times?

Pip install for the package needs [compiling][1] which will take time. Not sure but on your local instance it may have built the first time. One workaround is to extend the base [container][2] with the below (one time cost) and use it in SageMaker Estimator
ADD
./requirements.txt
/tmp/packages/
RUN python -m pip install --no-cache-dir -r /tmp/packages/requirements.txt
[1]: https://github.com/rusty1s/pytorch_scatter/blob/master/setup.py
[2]: https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.10/py3/cu113/Dockerfile.sagemaker.gpu

The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx
The key is to make sure nvidia runtime is used at build time, so daemon.json needs to be configured accordingly:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.

Related

How do you run pyflink scripts on AWS EMR?

I am struggling to run the basic word_count.py pyflink example that comes loaded with the apache flink on AWS EMR
Steps taken:
Successfully created AWS EMR 6.5.0 cluster with the following applications [Flink, Zookeeper] - verified that there is a flink and flink-yarn-session binary in $PATH. AWS says it installed v1.14.
Ran the java version successfully by doing the following
sudo flink-yarn-sessions
sudo flink run -m yarn-cluster -yid <application_id> /usr/lib/flink/examples/batch/WordCount.jar
Tried running the same with the python but no dice
sudo flink run -m yarn-cluster -yid <application_id> -py /usr/lib/flink/examples/python/table/word_count.py
This fails but error makes it obvious that its picking up python2.7 even though python3 is default!!
Fixed the issue by somewhat following this link. Then tried with a simple example to print out sys.version. This confirmed that its picking up my python version
Try again with venv
sudo flink run -m yarn-cluster -yid <application_id> -pyarch file:///home/hadoop/venv.zip -pyclientexec venv.zip/venv/bin/python3 -py /usr/lib/flink/examples/python/table/word_count.py
At this point, I start seeing various issues ranging from no file found to mysterious
pyflink.util.exceptions.TableException: org.apache.flink.table.api.TableException: Failed to execute sql
I ran various permutation of with/without yarn cluster. But no progress made thus far.
I am thinking my issues are either environment related (why isn't AWS taking care of proper python version is beyond me) or my inexperience with yarn/pyflink.
Any pointer would be greatly appreciated.
This is what you do. To make a cluster:
aws emr create-cluster --release-label emr-6.5.0 --applications Name=Flink --configurations file://./config.json --region us-west-2 --log-uri s3://SOMEBUCKET --instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes KeyName=YOURKEYNAME,InstanceProfile=EMR_EC2_DefaultRole --steps Type=CUSTOM_JAR,Jar=command-runner.jar,Name=Flink_Long_Running_Session,Args=flink-yarn-session,-d
Contents of config.json:
[
{
"Classification": "flink-conf",
"Properties": {
"python.executable": "python3",
"python.client.executable": "python3"
},
"Configurations": [
]
}
]
Then once you are in, try this
sudo flink run -m yarn-cluster -yid YID -py /usr/lib/flink/examples/python/table/batch/word_count.py
You can find the YID in the AWS EMR console under application user interfaces.

Glitches installing wxpython?

Im relatively new to python world.
Im trying to install wxpython on several computers and it keeps failing.
I use anaconda version 4.9.2 and use the prompt command:
conda install -c anaconda wxpython
I get the following error message:
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
I try updating python to latest version. I try a number of things. and still I get this pesky problem. What am I doing wrong?
Thanks!
Nothing wrong per se. Those messages are indicating that Conda can't install that package without changing the currently installed packages. Because the Anaconda distribution (!= Conda) has lots of packages this happens very frequently. Also, this particular package is not updated frequently and the anaconda channel doesn't even seem to keep pace with that.
In general, it is better practice is to create new environments for each project/task you have to work on, and only install the packages you require. Also, the conda-forge channel tends to be a more consistent provider for packages, but undergoes less interoperability testing and tuning than the Anaconda channel packages. That is, consider trying something like
conda create -n myenv -c conda-forge python=3.9 wxpython ...
where myenv is whatever you would like to refer to the environment as, and ... should be whatever other packages you know you would like to use.

Matlab integration - Run and test Matlab VOLTTRON Integration - pyzmq error+ volltron/config. path

Below the steps followed to integrate a fake building - fake modbus device (Ubuntu 16.04 LTS) with matlab-based interface.
Following the documentation steps at: http://volttron.readthedocs.io/en/4.1/devguides/walkthroughs/DrivenMatlabAgent-Walkthrough.html
Installation steps for system running Matlab:
Install python (my Python versions: 3.6.3 and 2.7.12)
Install pyzmq following the steps at (https://github.com/zeromq/pyzmq): I use pip install pyzmq
I get
Requirement already satisfied: pyzmq in ./env/local/lib/python2.7/site-packages
Steps for system running Matlab:
Install python – done
Install pyzmq –done
Install Matlab-- done (R2017b)
run pyversion --done
version: '2.7'
executable: '/home/USER_NAME/volttron/env/bin/python'
library: 'libpython2.7.so.1.0'
home: '/home/USER_NAME/volttron/env'
isloaded: 0
when I run py.zmq.pyzmq_version() I get
ans =
Python str with no properties.
15.4.0
I copied the example.m to the desktop.
Run and test Matlab VOLTTRON Integration:
To run and test the integration:
Assumptions
Device driver agent is already developed (master_driveragent-3.1.1- is installed)
Installation:
Install VOLTTRON –done
Add subtree volttron-applications under volttron/applications by using the following command –
For adding subtree: I used the code:
git subtree add --prefix applications https://github.com/VOLTTRON/volttron- applications.git develop --squash
error
(Working tree has modifications. Cannot add.)
Configuration
Copy example configuration file applications/pnnl/DrivenMatlabAgent/config_waterheater to volltron/config. (I could not find a path called config?)
Questions
Please is there any issue in pyzmq ?
In the volttron root I run the subtree command, why it is not accepting to add the subtree?
What is the volltron/config. path?
Thanks,
Looks like you have you have local changes in your cloned volttron directory. Please stash or commit those changes before adding subtree.
If config folder does not exists you can create it (I will make a note of it in the documentation as well) It is only a location to copy the config file to make changes ( config_url and data_url )

How to install MEAN stack on cent os

I am new to stack I have been trying to learn MEAN Stack but unable to install it on Cent OS 7. I have a system with the following configurations
1. i7 Proccessor
2. 12 GB RAM
3. 512 GB Hard Disk
You will have to add yum repository of node.js to the system by running the following commands in succession to add the yum repository.
# yum install -y gcc-c++ make
# curl -sL https://rpm.nodesource.com/setup_6.x | sudo -E bash -
Then install nodejs using
# yum install nodejs
This will also install npm for you, now you're good to go for using nodejs and express (E and N of MEAN stack)
You also need to setup mongodb if that is what you need to use as database ideally MEAN stack uses mongodb for that please follow their official documentation Here is the link to it. That takes care of M of MEAN stack.
Now for the A part you need Angular I would suggest you look at the versions of Angular available and based on it choose how to set it up they have pretty good documentation. You can choose from AngularJS, Angular 2 or Angular 4.
Hope this helps.

What are dronekit-python dependencies?

The dronekit Getting Started page suggests installing WinPython to use dronekit-Python on Windows because it includes the dependencies. I already have a working Python installation and I prefer not to risk messing it up with WinPython. What are the dependencies I need to install?
As of DKPY 2.0 this is outdated. Also, I might move to making a MavProxy module depending on whether or not the unpaid devs decide to stay when 3DR stops funding Dronekit
I've written a procedure to help with this problem which I've pasted. 3DR claims they're going to fix it, but in the mean time I hope this will help.
This setup is for Windows 64-bit systems only, although similar procedures will work with 32-bit.
Install MAVProxy and run it once before reaching step 5.
Install Notepad++.
Install Python v2.7.
Inside the Python folder, run WinPython Control Panel and select Advanced->Register Python.
Inside the same folder, run WinPython Command Prompt and input the following four commands:
• pip uninstall python-dateutil
• pip install droneapi
• pip install console
• echo module load droneapi.module.api >> %HOMEPATH%\AppData\Local\MAVProxy\mavinit.scr
Install WX Python. It should be the 64-bit Python 2.7 version.
Download and install OpenCV 2.4.11 to any folder
• Copy/paste the file cv2.pyd from OpenCV\build\python\2.7\x64\ to \python-2.7.6.amd64\Lib\site-packages.
Steps 8 through 11 apply to SITL only
Follow the online documentation for setting up Cygwin for SITL in Windows
Go to C:\cygwin\home\Your Username\ardupilot\Tools\autotest\
Open sim_vehicle.sh in Notepad++
• Change line 429 from…
cygstart -w "/cygdrive/c/Program Files (x86)/MAVProxy/mavproxy.exe" $options --cmd="$extra_cmd" $*
to...
cygstart -w "/cygdrive/c/Users/YOUR USERNAME HERE/Desktop/WinPython-64bit-2.7.6.4/python-2.7.6.amd64/Dronekit/Scripts/mavproxy.py" $options --cmd="$extra_cmd" $*
Note: This location changes depending on where you installed WinPython. For me, it was the desktop.
Start simulations as you would normally for SITL. To run Python scripts during the simulations, use the command
• api start Path to script\script_name
To use the code to connect to an actual copter, open WinPython Command Prompt
• Navigate to the folder which contains the scripts you wish to test
• Type mavproxy.py --master=”com##”,57600
• Run your script by typing into the MAVProxy terminal
o api start script_name

Resources