Unable to run multiple Jobs in apache flink Parallelly - apache-flink

1) I have installed Apache Flink in my local machine (Ubuntu 16.04). Developed a java programs, created Jar files and trying to run them as a jobs in Flink web front end. I am able to run the job individually, but couldn't able to run multiple jobs parallel.
Please let me know if any configuration has to be modified, so that i can run them simultaneously.
2) Unable to run the flink job which is having multiple tasks (>500 task in a single job) Getting following exceptions:
(a) network.partition.PartitionNotFoundException (org.apache.flink.runtime.io.network.partition.PartitionNotFoundException)
(b) rest.handler.taskmanager.TaskManagerDetailsHandler (org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerDetailsHandler - Implementation error: Unhandled exception)
(c) heap memory exception
Please let me know how to overcome these and what are the necessary configuration needed to run the job.
Tried to increase memory to 2048.

Related

Flink application TaskManager timeout exception Flink 1.11.2 running on EMR 6.2.1

We are currently running a Flink application on EMR 6.2.1 which runs on Yarn. The flink version is 1.11.2
We're running at a parallelism of 180 with 65 task nodes in EMR. When we start up the yarn application we run the following:
flink-yarn-session --name NameOfJob --jobManagerMemory 2048 --taskManagerMemory 32000 --slots 4 --detached
When the flink job is deployed it takes about 15 minutes to start up and for all the SubTasks to start running. We see several restarts and the following exception
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id X timed out
The job then starts up but after about 20 hours of running the checkpoints start failing and again we see the exception above.
I have seen similar questions around maybe the job manager memory needs to be increased but I'm struggling to find guidance on by how much to increase it or what recommendations may be. The other issue is by the time we can look at the logs of the task manager which fails, it's been killed and the logs are gone.
Any guidance around whether increasing the job manager memory will help and by what increments we should look at doing it would be appreciated.

Daemon thread doesn't complete it's execution when we restart zookeeper

In our current architecture of the project we are using solr for gathering, storing and indexing documents from different sources and making them searchable in near real-time
Our web applications running on tomcat connecting to solr to create / modify the documents
Solr uses Zookeeper to keep the configuration centralized
There are 5 servers in our cluster where we are running solr
when the zookeeper restarts in one of the server the daemon thread created in the server doesn't complete it's execution due to which
We are getting continuous logs with below exceptions while trying to connect to zookeeper from tomcat instance
org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [org.apache.zookeeper.ClientCnxn$SendThread]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
which in some time runs out of thread in the server
can someone help me with the below question please ?
why the daemon thread doesn't complete it's execution when we restart zookeeper
Solr Version : 8.5.1
zookeeper version : 3.5.5

Flink Task Manager Status When Application Crashes

What happens when there is an Exception thrown from the jar application to the Task Manager while processing an event?
a) Flink Job Manager will kill the existing task manager and create a new task manager?
b) Task manager itself recovers from the failed execution and restart process using local state saved in RocksDB?
java.lang.IllegalArgumentException: "Application error-stack trace"
I have a doubt that if that same kind erroneous events are getting processed by each of the task manager available hence they all get killed and entire flink job is down.
I am noticing that if some application error comes then eventually entire job will get down.
Don't figured out the exact reason as of now.
In general, the exception in the Job should not cause the whole Task Manager to go down. We are talking about "normal" exceptions here. In such case the Job itself will fail and the Task Manager will try to restart it or not depending on the provided restart strategy.
Obviously, if for some reason Your Task Manager will die, for example due to the timeouts or something else. Then it will not be restarted automatically if You do not use some resource manager or orchestration tool like YARN or Kubernetes. The job in such case should be started after there are slots available.
As for the behaviour that You have described that the Job itself is "going down" I assume here that the job is simply going to FAILED state. This is due to the fact that different restart strategies have different thresholds for max number of retries and If the job will not work after the specified number of restarts it will simply go to failed state.

Remote debugging Flink local cluster

I want to deploy my jobs on a local Flink cluster during development (i.e. JobManager and TaskManager running on my development laptop), and use remote debugging. I tried adding
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" to the flink-conf.yaml file. Since job and task manager are running on the same machine, the task manager throws exception stating that the socket is already in use and terminates. Is there any way I can get this running.
You are probably setting env.java.opts, which affects all JVMs started by Flink. Since the jobmanager gets started first, it grabs the port before the taskmanager is started.
You can use env.java.opts.taskmanager to pass parameters only for taskmanager JVMs.

Tomcat 6.0 is getting stopped after certain time automatically

Tomcat 6.0 is getting stoped after certain time automatically.. My machine is never turned off. but still this process is stopped . I am using My tomcat server in production mode.. and I really don't feel good starting my server daily.
What could be the reason because in Production mode server should never get stopped.
Check in your task scheduler;
Go To start->type in search task schduler
go to task scheduler. Check whether any task is running to stop the serverr.
or you can increase permgen space.
Server might be stop because of out of memory exception.

Resources