Flink checkpointing failing in Kubernetes with FsStateBackend - apache-flink

I am getting the error as stated below while using flink in kubernetes with per job state backend of FsStateBackend like so -: env.setStateBackend(new FsStateBackend("file:///data/flink/checkpoints"))
I am setting it in my code itself.
Error -:
Mkdirs failed to create file:/data/flink/checkpoints/3321ab76ccf319397f5b52be25f6cd8d
Can someone suggest resolution for this -:
Thanks in advance. Cheers!!

In addition to what #chuckskull pointed out, also make sure that this file URI is accessible to every pod in your cluster. All of the task managers and the job manager have to be able to read and write the checkpoint files using this URI.

Here are a couple of things you can check:
Make sure that /data/flink/checkpoints exists.
Make sure that the user running the flink job has read/write access to this directory.

Related

Flink Service temporarily unavailable due to an ongoing leader election. Please refresh

This is the first time I use flink, after I downloaded https://dlcdn.apache.org/flink/flink-1.16.0/flink-1.16.0-bin-scala_2.12.tgz from the website and unzip it, I run this command to start it ./bin/start-cluster.sh , however when I want to look Flink UI on http://localhost:8081/ , an error occurred "Service temporarily unavailable due to an ongoing leader election. Please refresh."
I looked up the internet, and maybe there are two possible reasons.
I have started multiple flink clusters, and I should clean up all the flink processes, but when I use this command ps aux | grep flink , I don't find multiple processes, just two.
The problem is from the zookeeper, but I don't know how to solve it.
the directory structure is as follow, does anyone know which part should I change?
enter image description here
java11 macos M1
I'd appreciate it if someone replied.
I want to see the Flink UI, but now when I go to http://localhost:8081/ , I can only get a message "Service temporarily unavailable due to an ongoing leader election. Please refresh".
This is not a flink related question, the problem is the hostname, write HOST="localhost" to ~.bash_profile and reload it using source ~/.bash_profile in the command line.

Using Flink LocalEnvironment for Production

I wanted to understand the limitations of LocalExecutionEnvironment and if it can be used to run in production ?
Appreciate any help/insight. Thanks
LocalExecutionEnvironment spins up a Flink MiniCluster, which runs the entire Flink system (JobManager, TaskManager) in a single JVM. So you're limited to CPU cores and memory available on that one machine. You also don't have HA from multiple JobManagers. I haven't looked at other limitations of the MiniCluster environment, but I'm sure more exist.
A LocalExecutionEnvironment doesn't load a config file on startup, so you have to do all of the configuration in the application. By default it also doesn't offer a REST endpoint. You can solve both these issues by doing something like this:
String cwd = Paths.get(".").toAbsolutePath().normalize().toString();
Configuration conf = GlobalConfiguration.loadConfiguration(cwd);
env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
Logging may be another issue that will require a workaround.
I don't believe you'll be able to use the Flink CLI to control the job, but if you create the Web UI (as shown above) you can at least use the REST API to do things like triggering savepoints (after first using the REST API to get the job ID).

change log files output path for flink jobs that run on yarn

We have a few flink jobs that run on yarn. We would like to upload flink job logs to ELK to simplify debugging/analysis. Currently flink task managers write logs to /mnt/flinklogs/$application_id/$container_id. We want to have it write to a directory without $applicatoin_id/$container_id nested structure.
I tried with env.log.dir: /mnt/flink. With this setting, the configuration is not passed correctly.
-Dlog.file=/mnt/flinklogs/application_1560449379756_1312/\
container_e02_1560449379756_1312_01_000619/taskmanager.log
I think that the best approche to solve this is using yarn log aggregation to write to log to disk and elastic filebit to send them to elastic.

Passing custom parameters to docker when running Flink on Mesos/Marathon

My team are trying set-up Apache Flink (v1.4) cluster on Mesos/Marathon. We are using the docker image provided by mesosphere. It works really well!
Because of a new requirement, the task managers have to launched with extend runtime privileges. We can easily enable this runtime privileges for the app manager via the Marathon web UI. However, we cannot find a way to enable the privileges for task managers.
In Apache Spark, we can set spark.mesos.executor.docker.parameters privileged=true in Spark's configuration file. Therefore, Spark can pass this parameter to docker run command. I am wondering if Apache Flink allow us to pass a custom parameter to docker run when launching task managers. If not, how can we start task managers with extended runtime privileges?
Thanks
There is a new parameter mesos.resourcemanager.tasks.container.docker.parameters introduced in this commit which will allow passing arbitrary parameters to Docker.
Unfortunately, this is not possible as of right now (or only for the framework scheduler as Tobi pointed out).
I went ahead and created a Jira for this feature so you can keep track/add details/contribute it yourself: https://issues.apache.org/jira/browse/FLINK-8490
You should be able to tweak the setting for the parameters in the ContainerInfo of https://github.com/mesoshq/flink-framework/blob/master/index.js to support this. I’ll eventually update the Flink version in the Docker image...

Questions regarding Flink streaming with Kafka

I have a Java application to lunch a flink job to process Kafka streaming.
The application is pending here at the job submission at flinkEnv.execute("flink job name") since the job is running forever for streamings incoming from kafka.
In this case, how can I get job id returned from the execution? I see the jobid is printing in the console. Just wonder, how to get jobid is this case without flinkEnv.execute returning yet.
How I can cancel a flink job given job name from remote server in Java?
As far as I know there is currently no nice programmatic way to control Flink. But since Flink is written in Java everything you can do with the console can also be done with internal class org.apache.flink.client.CliFrontend which is invoked by the console scripts.
An alternative would be using the REST API of the Flink JobManager.
you can use rest api to consume flink job process.
check below link: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html.
maybe you can try to request http://host:port/jobs/overview to get all job's message that contains job's name and job's id. Such as
{"jobs":[{"jid":"d6e7b76f728d6d3715bd1b95883f8465","name":"Flink Streaming Job","state":"RUNNING","start-time":1628502261163,"end-time":-1,"duration":494208,"last-modification":1628502353963,"tasks":{"total":6,"created":0,"scheduled":0,"deploying":0,"running":6,"finished":0,"canceling":0,"canceled":0,"failed":0,"reconciling":0,"initializing":0}}]}
I really hope this will help you.

Resources