How do you run pyflink scripts on AWS EMR? - apache-flink

I am struggling to run the basic word_count.py pyflink example that comes loaded with the apache flink on AWS EMR
Steps taken:
Successfully created AWS EMR 6.5.0 cluster with the following applications [Flink, Zookeeper] - verified that there is a flink and flink-yarn-session binary in $PATH. AWS says it installed v1.14.
Ran the java version successfully by doing the following
sudo flink-yarn-sessions
sudo flink run -m yarn-cluster -yid <application_id> /usr/lib/flink/examples/batch/WordCount.jar
Tried running the same with the python but no dice
sudo flink run -m yarn-cluster -yid <application_id> -py /usr/lib/flink/examples/python/table/word_count.py
This fails but error makes it obvious that its picking up python2.7 even though python3 is default!!
Fixed the issue by somewhat following this link. Then tried with a simple example to print out sys.version. This confirmed that its picking up my python version
Try again with venv
sudo flink run -m yarn-cluster -yid <application_id> -pyarch file:///home/hadoop/venv.zip -pyclientexec venv.zip/venv/bin/python3 -py /usr/lib/flink/examples/python/table/word_count.py
At this point, I start seeing various issues ranging from no file found to mysterious
pyflink.util.exceptions.TableException: org.apache.flink.table.api.TableException: Failed to execute sql
I ran various permutation of with/without yarn cluster. But no progress made thus far.
I am thinking my issues are either environment related (why isn't AWS taking care of proper python version is beyond me) or my inexperience with yarn/pyflink.
Any pointer would be greatly appreciated.

This is what you do. To make a cluster:
aws emr create-cluster --release-label emr-6.5.0 --applications Name=Flink --configurations file://./config.json --region us-west-2 --log-uri s3://SOMEBUCKET --instance-type m5.xlarge --instance-count 2 --service-role EMR_DefaultRole --ec2-attributes KeyName=YOURKEYNAME,InstanceProfile=EMR_EC2_DefaultRole --steps Type=CUSTOM_JAR,Jar=command-runner.jar,Name=Flink_Long_Running_Session,Args=flink-yarn-session,-d
Contents of config.json:
[
{
"Classification": "flink-conf",
"Properties": {
"python.executable": "python3",
"python.client.executable": "python3"
},
"Configurations": [
]
}
]
Then once you are in, try this
sudo flink run -m yarn-cluster -yid YID -py /usr/lib/flink/examples/python/table/batch/word_count.py
You can find the YID in the AWS EMR console under application user interfaces.

Related

Amazon SageMaker ScriptMode Long Python Wheel Build Times for CUDA Components

I use PyTorch estimator with SageMaker to train/fine-tune my Graph Neural Net on multi-GPU machines.
The requirements.txt that gets installed into the Estimator container, has lines like:
torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
When SageMaker installs these requirements in the Estimator on the endpoint, it takes ~2 hrs to build the wheel. It takes only seconds on a local Linux box.
SageMaker Estimator:
PyTorch v1.10
CUDA 11.x
Python 3.8
Instance: ml.p3.16xlarge
I have noticed the same issue with other wheel-based components that require CUDA.
I have also tried building a Docker container on p3.16xlarge and running that on SageMaker, but it was unable to recognize the instance GPUs
Anything I can do to cut down these build times?
Pip install for the package needs [compiling][1] which will take time. Not sure but on your local instance it may have built the first time. One workaround is to extend the base [container][2] with the below (one time cost) and use it in SageMaker Estimator
ADD
./requirements.txt
/tmp/packages/
RUN python -m pip install --no-cache-dir -r /tmp/packages/requirements.txt
[1]: https://github.com/rusty1s/pytorch_scatter/blob/master/setup.py
[2]: https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/1.10/py3/cu113/Dockerfile.sagemaker.gpu
The solution is to augment the stock estimator image with the right components and then it can be run in the SageMaker script mode:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10-gpu-py38
COPY requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.tx
The key is to make sure nvidia runtime is used at build time, so daemon.json needs to be configured accordingly:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
This is still not a complete solution, because viability of the build for SageMaker depends on the host where the build is performed.

vespa deploy --wait 300 app-1-getting-started get error-code "METHOD_NOT_ALLOWED"

When I follow this article to practice vepsa
https://docs.vespa.ai/en/tutorials/news-1-getting-started.html
when i do this step
vespa deploy --wait 300 app-1-getting-started
i got this error
{
"error-code": "METHOD_NOT_ALLOWED",
"message": "Method 'POST' is not supported"
}
why and how can i fix this error ?
I am unable to reproduce, I just ran through the steps. I suggest you submit an issue at https://github.com/vespa-engine/vespa/issues with your environment and also include vespa.log for the Vespa Team to have a look
Vespa deploys to http://localhost:19071/ and if the service running on that port is not the Vespa configuration service, but a different HTTP server that returns 405, this might explain the behavior you observe. The tutorial starts the Vespa container image using 3 port bindings
8080:8080 is the Vespa container (data plane, read and write)
19071:19071 is the Vespa configuration service which accepts app package (control plane)
docker run -m 10G --detach --name vespa --hostname vespa-tutorial \
--publish 8080:8080 --publish 19071:19071 --publish 19092:19092 \
vespaengine/vespa

How to "Run a single Flink job on YARN " by rest API?

From the Flink official document we know that we can "Run a single Flink job on YARN " by the command below ,my question is can we "Run a single Flink job on YARN " by Rest API, and got the application API ?
./bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
See the (somewhat deceptively named) Monitoring REST API. You can use the /jars/upload request to send your (fat/uber) jar to the cluster. This returns back an id, that you can use with the /jars/:jarid/run request to start your job.
If you also need to start up the cluster, then you're currently (AFAIK) going to need to write some Java code to start a cluster on YARN. There are two source files in Flink that do this same thing:
ProgramDeployer.java Used by the Flink Table API.
CliFrontEnd.java Used by the command line tool.

FLINK : Deployment took more than 60 seconds

I am new to flink and trying to deploy my jar on EMR cluster. I have used 3 node cluster (1 master and 2 slaves) with their default configuration. I have not done any configuration changes and sticking with default configuration. On running the following command on my master node:
flink run -m yarn-cluster -yn 2 -c Main /home/hadoop/myjar-0.1.jar
I am getting the following error:
INFO org.apache.flink.yarn.YarnClusterDescriptor- Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster
Can anyone please explain what could be the possible reason for this error?
As you didn't determine any resources (Memory, CPU core), I guess it's because the YARN cluster has not the desired resource, especially memory.
Try submitting your jar file using the following type of commands:
flink run -m yarn-cluster -yn 5 -yjm 768 -ytm 1400 -ys 2 -yqu streamQ my_program.jar
You can find more information about the command here
You can check application logs in YARN WebUI to see what's the problem exactly.
Also, check this posts:
Post1
post2

DC/OS installation failure during preflight

I am using 5 cloud-based VMs to install DC/OS
1 mesos master
3 mesos agent
1 launching VM
I have installed Docker on my launching VM and start installing DC/OS. It is running successfully during install_prereqs stage without any errors. But it's failing during preflight with below errors for each of my VM system.
STDERR:
Connection to 129.114.18.235 closed.
STDOUT:
Running preflight checks /opt/dcos_install_tmp/dcos_install.sh: line 225: getenforce: command not found
Checking if docker is installed and in PATH: FAIL
Checking if unzip is installed and in PATH: FAIL
Checking if ipset is installed and in PATH: FAIL
Checking if systemd-notify is installed and in PATH: FAIL
/opt/dcos_install_tmp/dcos_install.sh: line 387: systemctl: command not found
Checking if systemctl is installed and in PATH: FAIL
Checking Docker is configured with a production storage driver: /opt/dcos_install_tmp/dcos_install.sh: line 285: docker: command not found
Do I need to install all the required software into my master and agents VMS? Please guide.
We have a similar setup but using straight vm's. We found docker needs to be running on all nodes, including masters, before running the install. Also, make sure you look at: /etc/sysconfig/docker-storageand have: DOCKER_STORAGE_OPTIONS= -s overlayset in the file on all nodes.
I don't believe this is the production setup but should get you running. You also may want to check the privilege of the user executing the install on the remote nodes, does it have permission to see/run systemctl?
I had the same error with the DC/OS web installer in version 1.9
I solved the error after double-checking the bootstraps machines's private key in the web form. To create the key, log into the bootstrap machine and run:
$ ssh-keygen -t rsa
$ for i in `cat dcos-ips.txt`; do ssh-copy-id root#$i; done
$ cat ~/.ssh/id_rsa

Resources