Apache Flink Kubernetes Job Arguments - apache-flink

I'm trying to setup a cluster (Apache Flink 1.6.1) with Kubernetes and get following error when I run a job on it:
2018-10-09 14:29:43.212 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-10-09 14:29:43.214 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.flink.runtime.entrypoint.ClusterConfiguration.<init>(Ljava/lang/String;Ljava/util/Properties;[Ljava/lang/String;)V
at org.apache.flink.runtime.entrypoint.EntrypointClusterConfiguration.<init>(EntrypointClusterConfiguration.java:37)
at org.apache.flink.container.entrypoint.StandaloneJobClusterConfiguration.<init>(StandaloneJobClusterConfiguration.java:41)
at org.apache.flink.container.entrypoint.StandaloneJobClusterConfigurationParserFactory.createResult(StandaloneJobClusterConfigurationParserFactory.java:78)
at org.apache.flink.container.entrypoint.StandaloneJobClusterConfigurationParserFactory.createResult(StandaloneJobClusterConfigurationParserFactory.java:42)
at org.apache.flink.runtime.entrypoint.parser.CommandLineParser.parse(CommandLineParser.java:55)
at org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:153)
My job takes a configuration file (file.properties) as a parameter. This works fine in standalone mode but apparently the Kubernetes cluster cannot parse it
args: ["job-cluster", "--job-classname", "com.test.Abcd", "-Djobmanager.rpc.address=flink-job-cluster",
"-Dparallelism.default=1", "-Dblob.server.port=6124", "-Dquery.server.ports=6125", "file.properties"]
How to fix this?
Update: The job was built for Apache 1.4.2 and this might be the issue, looking into it.

The job was built for 1.4.2, the class with the error (EntrypointClusterConfiguration.java) was added in 1.6.1 (https://github.com/apache/flink/commit/ab9bd87e521d19db7c7d783268a3532d2e876a5d#diff-d1169e00afa40576ea8e4f3c472cf858) it seems, so this caused the issue.
We updated the job's dependencies to point to new 1.6.1 release and the arguments are parsed correctly.


Upgrading Apache Flink need to update pom.xml?

I've just upgraded my flink from version 1.9.1 to 1.11.2 (using docker)
I have already many flink jobs running in version 1.9.1
When I try to upgrade to 1.11.1 and re run my job, it shows error.
2020-11-12 06:49:17,731 WARN org.apache.zookeeper.ClientCnxn []
- SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-1135609831848314731.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2020-11-12 06:49:17,739 INFO org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server xxxxxx:2181
2020-11-12 06:49:17,741 ERROR org.apache.curator.ConnectionState [] - Authentication failed
And this is the error after deploying my flink job:
Caused by: java.lang.RuntimeException: API paths not defined
and also:
java.lang.NoSuchMethodError: org.apache.flink.api.common.state.OperatorStateStore.getSerializableListState(Ljava/lang/String;)Lorg/apache/flink/api/common/state/ListState;
Do I need to change every pom for my flink jobs?
Is there any work around without changing my source code?
Yes, you do have to rebuild your Flink jobs whenever you update the Flink version being used to run them. The libraries you use should be from the same exact version used by the Job Manager and Task Managers.
If you are trying to automate deployments for a CI/CD pipeline, you could inject the version number into the pom.xml using an environment variable -- but doing things like that can make it hard to debug when things go wrong.

Error while deploying flink application on EMR

I am getting this error when I deploy my flink application on EMR
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/flink/api/common/serialization/DeserializationSchema
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.util.RunJar.run(RunJar.java:232)
Although, it works fine when I deploy on a local cluster. I am using flink 1.9.0 on EMR version 5.28.0
This issue can be connected with multiple different things. Things to check are:
Version mismatch between Flink in dependencies and Flink on EMR.
The core dependencies of Flink should be `provided. To not cause clash with the dependencies that are available on cluster.
What is your JDK version? Is it possible that there is a problem with the environment? I think it is very likely that the JDK version does not match

MS WINDOWS: Declaring setenv.bat for Tomcat9 for use with THREDDS server - What is wrong with my syntax?

UPDATE: I've tried starting TomCat from commandline. During the startup messages I get this:
15-Mar-2019 09:05:08.603 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deploying web application archive [C:\Program Files\ASF\Tomcat9\webapps\thredds.war]
15-Mar-2019 09:05:15.900 INFO [main] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
15-Mar-2019 09:05:18.286 INFO [main] org.hibernate.validator.internal.util.Version.<clinit> HV000001: Hibernate Validator 4.3.2.Final
15-Mar-2019 09:05:19.382 SEVERE [main] org.apache.catalina.core.StandardContext.startInternal One or more listeners failed to start. Full details will be found in the appropriate container log file
15-Mar-2019 09:05:19.383 SEVERE [main] org.apache.catalina.core.StandardContext.startInternal Context [/thredds] startup failed due to previous errors
15-Mar-2019 09:05:19.460 WARNING [main] org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The web application [thredds] appears to have started a thread named [Log4j2-TF-12-Scheduled-2] but has failed to stop it. This is very likely to create a memory leak. Stack trace of thread:
sun.misc.Unsafe.park(Native Method)
15-Mar-2019 09:05:19.469 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [C:\Program Files\ASF\Tomcat9\webapps\thredds.war] has finished in [10,866] ms
I'm starting to believe that Gerhard is right that it is not a batch issue?
I'm trying to deploy the THREDDS server (version 4.6.13) for Windows using Tomcat 9. However, when I try to start the server I get:
FAIL - Application at context path [/thredds] could not be started
I've tried creating setenv.bat from a working setenv.sh on a Linux-machine, but my batch-scripting is more than rusty, but I certainly believe it's down to wrong syntax.
My script file is as follows:
set "CATALINA_HOME=%ProgramFiles%/ASF/Tomcat9"
set "CATALINA_BASE=%ProgramFiles%/ASF/Tomcat9"
set "JAVA_HOME=%ProgramFiles%/AdoptOpenJDK/jdk8u202-b08-jre"
:: TDS specific ENVARS
:: Define where the TDS content directory will live
set "CONTENT_ROOT=-Dtds.content.root.path=%ProgramFiles%/ASF/Tomcat9/content"
:: set java prefs related variables (used by the wms service, for example)
set "JAVA_PREFS_ROOTS=-Djava.util.prefs.systemRoot=%CATALINA_HOME%/content/thredds/javaUtilPrefs -Djava.util.prefs.userRoot=%CATALINA_HOME%/content/thredds/javaUtilPrefs"
:: Some commonl used JAVA_OPTS settings:
set NORMAL="-d64 -Xmx4096m -Xms512m -server -ea"
set HEAP_DUMP="-XX:+HeapDumpOnOutOfMemoryError"
set HEADLESS="-Djava.awt.headless=true"
Where did I mess up?

Task Manager not able to connect to Job Manager

I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
When I bring up the cluster, the task managers refuse to connect to the job managers with the following error.
2019-03-14 10:34:41,551 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink#cluster:22671] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#cluster:22671]] Caused by: [cluster: Name or service not known]
Now, this works correctly if I add the following line into the /etc/hosts file.
x.x.x.x job-manager-address.com cluster
Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink 1.4.2 used to have the job manager's address instead of the word cluster.
The jobmanager.sh script was being invoked with a second argument called cluster.
${Flink_HOME}/bin/jobmanager.sh start cluster
Prior to 1.5, the script expected an execution mode (local or cluster) but this is no longer the case. Invoking the script without the second argument solved this issue.
${Flink_HOME}/bin/jobmanager.sh start

Collecting Metrics with Graphite Plugin leads to "A metric named [..] already exists" error

when i configure the flink-conf.yaml to collect metrics with the graphite plugin,
the most time only incomplete metrics are being sent. On the Taskmanager output multiple errors occur like:
2018-08-15 00:58:59,016 WARN org.apache.flink.runtime.metrics.MetricRegistryImpl - Error while registering metric.
java.lang.IllegalArgumentException: A metric named mycomputer.taskmanager.8ceab4c3dfbf9fc5fa2af0447f1373a1.State machine job.Source: Custom Source.0.numRecordsOut already exists
at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)
at org.apache.flink.dropwizard.ScheduledDropwizardReporter.notifyOfAddedMetric(ScheduledDropwizardReporter.java:131)
at org.apache.flink.runtime.metrics.MetricRegistryImpl.register(MetricRegistryImpl.java:329)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:379)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.counter(AbstractMetricGroup.java:312)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.counter(AbstractMetricGroup.java:302)
at org.apache.flink.runtime.metrics.groups.OperatorIOMetricGroup.<init>(OperatorIOMetricGroup.java:41)
at org.apache.flink.runtime.metrics.groups.OperatorMetricGroup.<init>(OperatorMetricGroup.java:48)
at org.apache.flink.runtime.metrics.groups.TaskMetricGroup.addOperator(TaskMetricGroup.java:146)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.setup(AbstractStreamOperator.java:174)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.setup(AbstractUdfStreamOperator.java:82)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.<init>(OperatorChain.java:143)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:267)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
I've tried this on a completely freshly prepared flink-1.6.0 release with following config and the precompiled "State machine job" in the examples folder:
metrics.reporters: grph
metrics.reporter.grph.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.grph.host: localhost
metrics.reporter.grph.port: 2003
metrics.reporter.grph.interval: 1 SECONDS
metrics.reporter.grph.protocol: TCP
I use the official graphite docker image (https://hub.docker.com/r/graphiteapp/docker-graphite-statsd/) that is running on the default configuration.
Has anybody an idea, how i can fix this issue?
Thank's and best regards
to exclude that a specific local setting is responsible for this behaviour, I repeated the process on a clean EC2 instance. There's exactly the same error here.
How to reproduce:
start EC2 t2.xlarge
installed java
download flink at https://www.apache.org/dyn/closer.lua/flink/flink-1.6.0/flink-1.6.0-bin-scala_2.11.tgz
added the flink-metrics-graphite-1.6.0.jar to lib
configured the flink-yaml.conf as mentioned in my previous post
./bin/flink run examples/streaming/StateMachineExample.jar
I have not set up graphite in this case, because the error obviously already
occurs before.
After the job has been started you can view the error in the flink dashboard under Task Manager -> Logs
