Flink Service temporarily unavailable due to an ongoing leader election. Please refresh - apache-flink

This is the first time I use flink, after I downloaded https://dlcdn.apache.org/flink/flink-1.16.0/flink-1.16.0-bin-scala_2.12.tgz from the website and unzip it, I run this command to start it ./bin/start-cluster.sh , however when I want to look Flink UI on http://localhost:8081/ , an error occurred "Service temporarily unavailable due to an ongoing leader election. Please refresh."
I looked up the internet, and maybe there are two possible reasons.
I have started multiple flink clusters, and I should clean up all the flink processes, but when I use this command ps aux | grep flink , I don't find multiple processes, just two.
The problem is from the zookeeper, but I don't know how to solve it.
the directory structure is as follow, does anyone know which part should I change?
enter image description here
java11 macos M1
I'd appreciate it if someone replied.
I want to see the Flink UI, but now when I go to http://localhost:8081/ , I can only get a message "Service temporarily unavailable due to an ongoing leader election. Please refresh".

This is not a flink related question, the problem is the hostname, write HOST="localhost" to ~.bash_profile and reload it using source ~/.bash_profile in the command line.

Related

Why are Solace messages hanging on the server on restart after Camel 3 upgrade?

On upgrading to Camel 3.3 we came accorss an issue while testing. So, if we restart our camel application while there are multiple messages in the queue for some reason a few messages are getting stuck in an "unacknowledged" state on the solace queue and after the appliation comes back up it doesn't consume those messages. We need to resart the application once more before they get consumed.
This issues only seems to occur only when there are a large number of messages in the queue at the time of restart.
We have not been able to recreate the issue on Camel 2.x versions.
The application is correctly setup to AUTO_ACKNOWLEDGE and it works normally in all other circumstances.
"acceptMessagesWhileStopping" is set to false.
We haven't seen any message loss or duplication.
We went through all the setup and configuration that happens at startup and we haven't found any issues with either the setup or configuration. I am not sure how to go about with debugging this as this is a shutdown related issue that too while messages are in the process of consumption. Any advice on how to go forward would be helpful. Regards.
P.S. I have gone through the Camel 3 migration guides. I didn't find anything pertinent to the issue there.
Newer versions of camel 3 doesn't seem to have this issue.

Flink checkpointing failing in Kubernetes with FsStateBackend

I am getting the error as stated below while using flink in kubernetes with per job state backend of FsStateBackend like so -: env.setStateBackend(new FsStateBackend("file:///data/flink/checkpoints"))
I am setting it in my code itself.
Error -:
Mkdirs failed to create file:/data/flink/checkpoints/3321ab76ccf319397f5b52be25f6cd8d
Can someone suggest resolution for this -:
Thanks in advance. Cheers!!
In addition to what #chuckskull pointed out, also make sure that this file URI is accessible to every pod in your cluster. All of the task managers and the job manager have to be able to read and write the checkpoint files using this URI.
Here are a couple of things you can check:
Make sure that /data/flink/checkpoints exists.
Make sure that the user running the flink job has read/write access to this directory.

How to delete failed jobs en masse in Camunda Cockpit?

I have a Camunda flow that has failed about 30k times, because the service it was hitting was shut down. Is there a method for removing all of the failed jobs at once from inside the project cockpit?
I had similar problems a few years ago and created a commandline application for accessing the rest api of the cockpit. It is available at https://github.com/jhorstmann/camunda-cockpit-client, an example usage would be
cockpit-client.py -u username -p password --all -e live -m "error message" --cancel
Where live referes to a section in a cockit-client.yaml configuration file, as described in the readme.
I no longer maintain that code, but maybe it solves your problem. If you use the enterprise version of camunda cockpit you could probably do the same with its built-in batch operations: https://docs.camunda.org/manual/7.11/webapps/cockpit/batch/batch-operation/.
I was unable to find a means to do this via the cockpit but since each instance of Camunda uses a different schema, I was able to drop and rebuild the schema to remove all of the failed jobs.

How to get Flink started in FLIP6 mode

We are using Flink-1.4 on a cluster of 3 machines.
We started the JobManager on one machine with the following command
bin/jobmanager.sh start flip6
Next, we started the TaskManager on two machines with the following command:
bin/taskmanager.sh start flip6
However, we do not see the Flink Dashboard Web UI up.
Neither do we see any errors in the logs.
Is there something we are missing, maybe in the config file?
Here is the log for the JobManager:
https://gist.github.com/jamesisaactm/72cda2bb286d3a3e20f91e64138941b6
For 1.4 the FLIP-6 mode is still a WIP and missing major parts, like the WebUI.
You will have to wait for 1.5 to use the FLIP-6 mode.

Error restarting Zeppelin interpreters and saving their parameters

I have Zeppelin 0.7.2 installed and connected to Spark 2.1.1 standalone cluster.
It has been running fine for quite a while until I changed the Spark workers' settings, to double the workers' cores and executor memory. I also tried to change the parameters SPARK_SUBMIT_OPTIONS and ZEPPELIN_JAVA_OPTS on zeppelin-env.sh, to make it request for more "Memory per node" on the Spark workers but it always requests only 1GB per node so I removed them.
I had an issue while developing a paragraph so I tried set zeppelin.spark.printREPLOutput to true on the web interface. But when I tried to save that setting, I only got a small transparent red box at right side of my browser window. So it fails to save that setting. I also got that small red box when I tried to restart the Spark interpreter. The same actually happens when I tried to change the parameters of all other interpreters or restart them.
There is nothing on the log files. So I am quite puzzled on this issue. Do any of you has ever experienced this issue? If so, what kind of solutions that you applied to fix it? If not, do you have any suggestions on how to debug this issue?
I am sorry for the noise. The issue was actually due to my silly mistake.
I actually have Zeppelin behind nginx. I recently played around with a new CMS. I didn't separate the configuration of the CMS and the proxy to Zeppelin. So any access to location containing /api/, like restarting Zeppelin interpreters or saving the interpreters' settings, got blocked. Separating the site configuration of the CMS and the proxy to Zeppelin on nginx solves the problem.

Resources