I want to run a Python code using Apache beam on Apache Flink. The command that the apache beam site for launching Python code on Apache Flink is as follows:
docker run --net=host apachebeam/flink1.9_job_server:latest --flink-master=localhost:8081
The following is a discussion of different methods of executing code using Apache Fail on Apache Flink. But I haven't seen an example of launching it.
https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
I want this code to run without Docker. How is this code commanded?
You can spin up the flink job server directly using the beam source source. Note you'll need to install java.
1) Clone the beam source code:
git clone https://github.com/apache/beam.git
2) Start the job server
cd beam
./gradlew -p runners/flink/1.8/job-server runShadow -PflinkMasterUrl=localhost:8081
Some helpful tips:
This is not flink itself! You'll need to spin up flink separately.
The flink job service actually spins up a few services:
Expansion Service (port 8097) : This service allows you to use ExternalTransforms within your pipeline that exist within the java sdk. For example the transforms found within the python sdk apache_beam.io.external.* hit this expansion service.
Artifact Service (port 8098) : This is where the pipeline uploads your python artifacts (e.g. pickle files, etc) to be used by the flink taskmanager when it executes your python code. From what I recall you must share the artifact staging area (default to /tmp/beam-artifact-staging) between the flink taskworker and this artifact service.
Job Service (port 8099) : This is what you submit your pipeline to. It translates your pipeline into something for flink and submits it.
Related
My goal is to create a streaming pipeline to read data from Apache Kafka, process the data, and write back to it.
Because of security reasons, I want to avoid Docker and use Podman.
I have set up a minimal cluster via a docker-compose.yml with a jobmanager, taskmanager and a Python SDK harness worker. The SDK harness worker seems to get stuck when i try to execute a pipeline.
Running the pipeline (reading a multi-line .txt file and writing it back in a file) it gets transferred to the jobmanager and taskmanager correctly, but then goes idle. When I look in the pythonsdk container, the logs show the following message repeatedly:
2022/12/04 16:13:02 Starting worker pool 1: python -m
apache_beam.runners.worker.worker_pool_main --service_port=50000
--container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1',
'--logging_endpoint=localhost:45087',
'--artifact_endpoint=localhost:35323',
'--provision_endpoint=localhost:36435',
'--control_endpoint=localhost:33237']
2022/12/04 16:16:31 Failed to obtain provisioning information: failed to
dial server at localhost:36435
caused by:
context deadline exceeded
Here is a link to a test pipeline that was created:
Example on github
Environment:
Debian 11;
Podman;
Python 3.2.9;
apache-beam==2.38.0; and
podman-compose
The setup of the cluster defined in:
docker-compose.yml
1x flink-jobmanager (flink version 1.14)
1x flink-taskmanager
1x Python Harness SDK
I chose to create a SDK container manually because I don't have Docker installed and Flink fails when it tries to create a container
over Docker.
I suspect that I have made a mistake in the network setup or there are some configurations missing for the harness worker, but I could not figure out the problem. Any thoughts?
Crossposted in user mailing list of beam.apache.org
I've just upgraded my flink from version 1.9.1 to 1.11.2 (using docker)
I have already many flink jobs running in version 1.9.1
When I try to upgrade to 1.11.1 and re run my job, it shows error.
2020-11-12 06:49:17,731 WARN org.apache.zookeeper.ClientCnxn []
- SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-1135609831848314731.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2020-11-12 06:49:17,739 INFO org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server xxxxxx:2181
2020-11-12 06:49:17,741 ERROR org.apache.curator.ConnectionState [] - Authentication failed
And this is the error after deploying my flink job:
Caused by: java.lang.RuntimeException: API paths not defined
and also:
java.lang.NoSuchMethodError: org.apache.flink.api.common.state.OperatorStateStore.getSerializableListState(Ljava/lang/String;)Lorg/apache/flink/api/common/state/ListState;
Do I need to change every pom for my flink jobs?
Is there any work around without changing my source code?
Thanks
Yes, you do have to rebuild your Flink jobs whenever you update the Flink version being used to run them. The libraries you use should be from the same exact version used by the Job Manager and Task Managers.
If you are trying to automate deployments for a CI/CD pipeline, you could inject the version number into the pom.xml using an environment variable -- but doing things like that can make it hard to debug when things go wrong.
I have an Apache Beam Pipeline that I am trying to deploy on a Flink Docker Cluster deployed locally.
The pipeline fails with
The RemoteEnvironment cannot be instantiated when running in a pre-defined context (such as Command Line Client, Scala Shell, or TestEnvironment)
org.apache.flink.api.java.RemoteEnvironmentConfigUtils.validate(RemoteEnvironmentConfigUtils.java:52)
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.validateAndGetEffectiveConfiguration(RemoteStreamEnvironment.java:178)
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.<init>(RemoteStreamEnvironment.java:158)
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.<init>(RemoteStreamEnvironment.java:144)
org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.<init>(RemoteStreamEnvironment.java:113)
org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.<init>(FlinkExecutionEnvironments.java:319)
org.apache.beam.runners.flink.FlinkExecutionEnvironments.createStreamExecutionEnvironment(FlinkExecutionEnvironments.java:177)
org.apache.beam.runners.flink.FlinkExecutionEnvironments.createStreamExecutionEnvironment(FlinkExecutionEnvironments.java:139)
org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.translate(FlinkPipelineExecutionEnvironment.java:98)
org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:108)
ApacheBeamPocJava.main(ApacheBeamPocJava.java:262)
This is how I am setting up the pipeline
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); options.setRunner(FlinkRunner.class);
options.setFlinkMaster(“localhost:6123”);
options.setFilesToStage(Arrays.asList("path to the beam jar"));
FlinkRunner flinkRunner = FlinkRunner.fromOptions(options);
Pipeline p= Pipeline.create(options);
And after defining the steps of the pipeline. I run it like this
flinkRunner.run(p);
This is how I submit the job
flink run -c ClassName PATH_TO_JAR
Can someone advise what is going wrong here?
Also if someone has a Beam <-> Flink examples handy for Java. I would definitely appreciate that too.
It seems that you have defined the running environment inside the pipeline itself. Have you tried launching your pipeline like described in the Flink runner documentation? (Remove the parts of your code where you are defining a runner, or configuring it.)
As Beam is a framework that decouples your code from the runner that is executing it, it's not necessary to have the Flink runner configuration in your pipeline code itself. If you can execute your pipeline locally with the direct runner, it will also work on the Flink runner (or any other one that is supported) when being compiled with the right profile.
bin/flink run -c org.apache.beam.examples.WordCount /path/to/your.jar –runner=FlinkRunner –other-parameters-for-your-pipeline-or-the-runner
Please be aware that there is currently a bug in Beam 2.25.0 for the Flink runner, so try it with version 2.24.0, or a later version when it's released.
From the Flink official document we know that we can "Run a single Flink job on YARN " by the command below ,my question is can we "Run a single Flink job on YARN " by Rest API, and got the application API ?
./bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
See the (somewhat deceptively named) Monitoring REST API. You can use the /jars/upload request to send your (fat/uber) jar to the cluster. This returns back an id, that you can use with the /jars/:jarid/run request to start your job.
If you also need to start up the cluster, then you're currently (AFAIK) going to need to write some Java code to start a cluster on YARN. There are two source files in Flink that do this same thing:
ProgramDeployer.java Used by the Flink Table API.
CliFrontEnd.java Used by the command line tool.
I followed this tutorial
to get a Bigtable client up and running in Google Managed VMs. But is there a way to run this locally? Reason is that deploying the code remotely in development is a pain.
Normally I can use dev_appserver.sh to run GAE app locally. But when I run it, I'm getting this error:
Caused by: java.lang.IllegalStateException: Jetty ALPN has not been
properly configured.
Which means we need to include ALPN library? Since our codebase is in Java 7, I used this ALPN version: 7.1.3.v20150130.
I then tried again with this:
dev_appserver.sh --jvm_flag=-Xbootclasspath/p:/Users/shouguoli/tmp/alpn-boot-7.1.3.v20150130.jar
still getting this error:
Caused by: com.google.apphosting.api.ApiProxy$CallNotFoundException:
The API package 'urlfetch' or call 'Fetch()' was not found.
How do you get it to work locally?
The sample was updated last week. It's based on the java 8 compat runtime, which means that you have access to most of the App Engine API's including Users, Task Queues, and Datastore.
There is a new Netty TCNative module that uses Boring SSL.
To use it with the pom.xml in the sample, do:
mvn clean -Pmac jetty:run -Dbigtable.projectID=<your-project> -Dbigtable.clusterID=<your-cluster> -Dbigtable.zone=<your-zone>
To use on Windows, use -Pwindows instead of -Pmac. For linux, omit the Profile -P as it's the default.
To deploy:
mvn clean gcloud:deploy -Dbigtable.projectID=<your-project> -Dbigtable.clusterID=<your-cluster> -Dbigtable.zone=<your-zone>
NOTE - it is advisable to do the clean between running locally and running remotely as the TCNative module is currently specific to the platform the code runs on.
We are in the process of updating all of our samples to use TCNative, we hope to have this by 3/10/16.