SageMaker Estimator fit job never ends - amazon-sagemaker

I have the following code
estimator = Estimator(
image_uri=ecr_image,
role=role,
instance_count=1,
instance_type=instance_type,
hyperparameters=hyperparameters
)
estimator.fit({"training": "s3://" + sess.default_bucket() + "/" + prefix})
which seems to run smoothly until it is stuck at:
Finished Training
2020-12-02 15:00:45,352 sagemaker-training-toolkit INFO Reporting training SUCCESS
and I see InProgress job in AWS SageMaker console. How can I fix this?
I use 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-eia:1.3.1-cpu-py36-ubuntu16.04 Docker image with pip install sagemaker-training added.

Related

Setting up a Flink cluster with Podman for a beampipeline with flinkrunner

My goal is to create a streaming pipeline to read data from Apache Kafka, process the data, and write back to it.
Because of security reasons, I want to avoid Docker and use Podman.
I have set up a minimal cluster via a docker-compose.yml with a jobmanager, taskmanager and a Python SDK harness worker. The SDK harness worker seems to get stuck when i try to execute a pipeline.
Running the pipeline (reading a multi-line .txt file and writing it back in a file) it gets transferred to the jobmanager and taskmanager correctly, but then goes idle. When I look in the pythonsdk container, the logs show the following message repeatedly:
2022/12/04 16:13:02 Starting worker pool 1: python -m
apache_beam.runners.worker.worker_pool_main --service_port=50000
--container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1',
'--logging_endpoint=localhost:45087',
'--artifact_endpoint=localhost:35323',
'--provision_endpoint=localhost:36435',
'--control_endpoint=localhost:33237']
2022/12/04 16:16:31 Failed to obtain provisioning information: failed to
dial server at localhost:36435
caused by:
context deadline exceeded
Here is a link to a test pipeline that was created:
Example on github
Environment:
Debian 11;
Podman;
Python 3.2.9;
apache-beam==2.38.0; and
podman-compose
The setup of the cluster defined in:
docker-compose.yml
1x flink-jobmanager (flink version 1.14)
1x flink-taskmanager
1x Python Harness SDK
I chose to create a SDK container manually because I don't have Docker installed and Flink fails when it tries to create a container
over Docker.
I suspect that I have made a mistake in the network setup or there are some configurations missing for the harness worker, but I could not figure out the problem. Any thoughts?
Crossposted in user mailing list of beam.apache.org

Amazon SageMaker ScriptMode Long Python Wheel Build Times for CUDA Components

I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.
The requirements.txt that gets installed into the Estimator container, has lines like:
torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.
SageMaker Estimator:
PyTorch v1.10
CUDA 11.x
Python 3.8
Instance: ml.p3.16xlarge
I have noticed the same issue with other wheel-based components that require CUDA.
I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs
Anything I can do to cut down these build times?
Pip install for the package needs [compiling][1] which will take time. Not sure but on your local instance it may have built the first time. One workaround is to extend the base [container][2] with the below (one time cost) and use it in SageMaker Estimator
ADD
./requirements.txt
/tmp/packages/
RUN python -m pip install --no-cache-dir -r /tmp/packages/requirements.txt
[1]: https://github.com/rusty1s/pytorch_scatter/blob/master/setup.py
[2]: https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.10/py3/cu113/Dockerfile.sagemaker.gpu
The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx
The key is to make sure nvidia runtime is used at build time, so daemon.json needs to be configured accordingly:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.

Deploying Haskell yesod docker container on google app engine

I am trying to upload a yesod Docker container on Google App Engine. The source code is here and the Docker image is here.
I followed the documentation in the Custom runtime quickstart, and when invoking gcloud app deploy the app builds fine after increasing the build timeout, but the container either the readiness check when trying to start or shows the following timeout message:
ERROR: (gcloud.app.deploy) Operation [apps/meeshkan-github-webhook-router/operations/xxxx-xxxx-xxxx] timed out. This operation may still be underway.
I have tried experimenting with several things, including a manual readiness check, creating an /_ah/health endpoint, and increasing the timeout of the readiness check all the way to 1799 seconds, but none of these actions seem to work.
One issue may be the size of the container (it is 3.2gb), and I could try to prune it down, but I'd only do that if someone could confirm that container size is a contributing factor to deployment problems. Other than that, I'm not sure what could be causing this failure. The docker image starts fine on our local machines.
Thanks in advance for your help and suggestions!
The issue turned out to be that, because I was building on Windows, images built using Docker Desktop on Windows gave all shell scripts executable permission automatically, whereas Docker on Linux needs shell scripts to be given the executable permission. By adding this line to my Dockerfile:
RUN chmod +x /usr/src/app/run.sh
Everything worked fine!

IBM Cloud Private-Community Edition - Waiting for cloudant database initialization

I tried below command
docker run --rm -t -e LICENSE=accept --net=host -v "$(pwd)":/installer/cluster ibmcom/icp-inception:2.1.0 install
the response is
Waiting for cloudant initialization
I entered the command received the logs shown in the image. No error shown. Please give a solution
From the error message, for cloudant database initialization issue, it may be caused by the cloudant docker image is pulled from dockerhub while ICP installation. The cloudant docker image is big, you can run below command to check whether the image is ready in your environment.
$ docker images | grep icp-datastore
If the cloudant docker image is ready in your environment, and the ICP installation still has cloudant database initialization issue, you can try to install the latest ICP 2.1.0.3 Community Edition. From 2.1.0.3, ICP removes the cloudant database. The ICP 2.1.0.3 installation documentation:
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0.3/installing/install_containers_CE.html
If you still want to check the cloudant database initialization issue in ICP 2.1.0.1 environment, you can:
Ensure your ICP nodes match the system and hardware requirements firstly.
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/supported_system_config/system_reqs.html
Let us know the ICP installation configurations. You can check the contents for config.yaml and hosts files.
Check the system logs (in /var/log/messages or /var/log/syslog file) to find the relevant errors.
Run 'docker logs ' command to check the logs or errors.

exec: "gcc": executable file not found in $PATH

I'm having an issue where an App Engine project will no longer build remotely (via gcloud app deploy)
This has started out of the blue, with no code changes at this end. Not sure if relevant, but it's a go 1.9 project deploying to the App Engine Flex environment.
I'm not sure how to test this in the same environment as the build, since the error is coming from Google's Container Registry
Here is the log from the Container Registry console
starting build "73f85b4d-7370-41bd-bbb2-bcf42fc38873"
FETCHSOURCE
Fetching storage object: gs://staging.[project].appspot.com/us.gcr.io/[project]/appengine/default.1ed3c690ead06f27aa651a30fab342611:latest#1531698266413753
Copying gs://staging.[project].appspot.com/us.gcr.io/[project]/appengine/default.1ed3c690ead49f731806f27aa630fab342611:latest#1531698266413753...
Operation completed over 1 objects/1.7 MiB.
BUILD
Starting Step #0
Step #0: Pulling image: gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db
Step #0: sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db: Pulling from gcp-runtimes/go1-builder
Step #0: Digest: sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db
Step #0: Status: Downloaded newer image for gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db
Step #0: exec: "gcc": executable file not found in $PATH Finished
Step #0 ERROR ERROR: build step 0 "gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db" failed: exit status 2
It looks like you are using container gcr.io/gcp-runtimes/go1-builder as your build step. Looking at the source in GitHub, I see that there have been no updates since ~late June. I see in the Dockerfile that the base image in the FROM directive is gcr.io/google-appengine/debian9:latest, and a look at that image reveals no gcc installed. I see no step in the Dockerfile installing gcc, and looking at your build step image confirms that it isn't there:
~$ docker run --rm -t -i --entrypoint /bin/bash gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db -- which gcc
Unable to find image 'gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db' locally
sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db: Pulling from gcp-runtimes/go1-builder
e154cec6816f: Pull complete
<pulls elided>
Digest: sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db
Status: Downloaded newer image for gcr.io/gcp-runtimes/go1-builder#sha256:c62ac3fbec31ddec70601d6c5b44d07063bcff6a823bdcf5e0bbaa9d3799d1db
~$
Perhaps an earlier version of the base debian9 image had it installed, you could dig into history to look. But it looks like there is no recent change to the go1-builder image to remove gcc.
If you need gcc, you can always separate building your app from deploying it. Build with your own cloudbuild.yaml via gcloud container builds submit and then deploy the built container using gcloud app deploy --image-url=... With full control over the build, you can always based on the go-builder image and install additional tooling you need like gcc on top of that before using Docker to build your final app container.

Resources