I am using the detach and yarn-cluster mode to run the flink application in job mode as follows:
flink run -d -m yarn-cluster -yn 10 -ys 1 -yqu QueueA -c com.me.MyFlinkApplicaiton
The application starts up and the job in this application starts to consume message from Kafka successuflly.
After running smoothly for serverl hours, the flink yarn application is alive/running, but the job in this application disappears(there is no job/task running any more), all the slots are freed.
My application is a simple read from Kafka source -> sink to mongodb application, and I have try/catch the whole sink function's invoke method, so there will no exception throws in sink function.
I didn't find usefull log to investigate this problem ,so I would ask what may happen that may cause this behavior
Ok, looks I have found out the problem, I have specified the restart strategy in the code as
env.setRestartStrategy(RestartStrategies.noRestart())
When the tm exits and jm is cancelled, flink will not try to restart the jm and tm.
Related
I'm using Flink 1.15 Docker images in Session mode pretty much the same as the Compose documentation. I have one Task Manager. A few minutes after starting my streaming job I get a stack dump log message from my Job Manager stating that the Task Manager is no longer reachable and I see that my Task Manager Docker container has exited with code 137 - which possibly indicates an out of memory error. Although docker inspect shows the OOMKilled flag as false indicating some sort of other issue.
End of stack trace from Job Manager log:
Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 172.18.0.5:44333-7c7193 is no longer reachable.
The TaskManager Docker logs produce no error whatsoever before exiting. If I resurrect the dead Task Manager Docker container and have a look at the log file in /opt/flink/logs/ then the last messages state that the various components in my pipeline have switched from INITIALIZING to RUNNING.
I would have expected an out of memory stack dump from the task manager if my state had become too large. Also docker inspect shows that the container did not exit because of an out of memory error.
I have no idea what causes my Task Manager to die. Any ideas how I can figure out what is causing the issue? (This happens on 1.15.1 & 1.15.2. I haven't used any other version of Flink.)
This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
I know you said docker inspect doesn't show that it shuts down because of memory issues, but still try to use more RAM or decrease the memory requirements of your tasks and see if it still crashes.
I ended up using nothing more sophisticated than trial and error with a variety of different test jobs. I'm not 100% sure I fixed the problem as the issue of the Task Manager crashing without an stack dump occurred sporadically. However the Task Manager hasn't crashed for several days.
The simplest job to recreate my issue was with a SourceFunction outputting a continuous stream of incrementing Longs straight to a DiscardingSink. With this setup the Task Manager would crash after a while on my Linux machine sporadically but never on my Mac.
If I added a Thread.sleep to the SourceFunctions run loop then the crash would eventually occur but take a bit longer.
I tried Source framework instead of SourceFunction where a SingleThreadMultiplexSourceReaderBase repeatedly calls fetch on a SplitReader to output the Longs. There have been fewer crashes since I did this so it didn't work 100%.
I presume my SourceFunction was overfilling some sort of buffer or making a task slot unresponsive as it never relinquished a slot once it started. (Or some other completely different explanation.)
I wish the Task Manager gave some sort of indication why it stopped running.
According to https://cloud.google.com/trace/docs/setup/php, App Engine flexible environment for PHP can run a daemon that sends trace spans to Stackdriver in the background rather than as part of the request processing (which could cause increased response latency).
I am running Kubernetes Engine, but would still like to send trace requests in the background. Therefore:
Is it possible to run that batch daemon myself?
Out of curiosity, how does the Stackdriver PHP Exporter pass these spans to the daemon? I tried to search for that in the source code, but could not find out how it is done.
If #1 is not possible, is there another way to perform span sending in the background?
Stackdriver Trace with Google Cloud Run seems to cover a similar topic, but does not address how to run the daemon manually.
In case anyone else is looking for this, I was able to run the batch daemon as follows:
sudo -u www-data -E vendor/bin/google-cloud-batch daemon
Note that the daemon itself must be run as the same user as your “serving” PHP processes in order to access SysV memory shared between both, hence the sudo.
You will also need the PHP sysv and pcntl extensions enabled.
I saw this line in Flink documentation but can't figure out what 'detached mode' means. Please help. Thanks.
Run example program in detached mode:
./bin/flink run -d ./examples/batch/WordCount.jar
The Flink CLI runs jobs either in blocking or detached mode. In blocking mode, the CliFrontend (client) process keeps running, blocked, waiting for the job to complete -- after which it will print out some information. In the example below I ran a streaming job, which I cancelled from the WebUI after a few seconds:
$ flink run target/oscon-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID b02da01c30585bfbc86a23446559987f has finished.
Job Runtime: 8673 ms
If you run in blocking mode, you can kill the CliFrontend (e.g., with ctrl-C) if you like, and the job will be unaffected, so long as it has run far enough to have submitted the job to the cluster.
In detached mode, the CliFrontend submits the job to the cluster and then exits straight away.
That means that the application is not attached (or bound) to your shell session. So if you close your terminal the application will still keep running (until it finished its work). For a batch example that might not be a big problem - they will process the given batch of data and end afterwards. As soon as you skip to a streaming approach the operations will take place on an "infinite stream of data" and have no defined end.
Hope that helps.
I run my application in Flink standalone, but can't find it's sysout in console or FLINK_HOME/log.
Does anyone know where I can see my application debug log? And how to know which TMs my application run on?
When running a Flink application in standalone mode on a cluster, everything that is logged to system out or system err goes into the respective local log/ directories.
So for getting the logs, you have to connect (for example using SSH) to the machines running TaskManagers and retrieve the logs from there.
And how to know which TMs my application run on.
The JobManager web interface (running on host:8081 by default) shows where the tasks are deployed to.
When the parallelism == number of slots, the tasks usually run on all machines.
At some point my site, running on Apache2 with mod_wsgi just stops processing requests. The connection to server is maintained and client waits for responce, but it never is returned by apache. The server at this time is at 0% CPU, and nothing is processing. I think, apache just sends request to queue and never gets them out of there.
When I perform apache2ctl graceful the problem does not resolve. Only after apache2ctl restart.
My site is a 4 instance wsgi application of Pyramid and 2 instances of Zope 3. It is running normaly and does not have speed problems, that I am aware of.
versions:
Ubuntu 10.04
apache2 2.2.14-5ubuntu8.9
libapache2-mod-wsgi 2.8-2ubuntu1
Sounds like you are using embedded mode to run the multiple applications and you are using third party C extensions that have problems in sub interpreters, resulting in potential deadlock. Else your code is internally deadlocking or blocking on external services and never returning, causing exhaustion of available processes/threads.
For a start, you should look at using daemon mode and delegate each web application to a distinct daemon process group and then forcing each to run in the main interpreter.
See:
http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide#Delegation_To_Daemon_Process
http://code.google.com/p/modwsgi/wiki/ApplicationIssues#Python_Simplified_GIL_State_API
Otherwise use debugging tips described in:
http://code.google.com/p/modwsgi/wiki/DebuggingTechniques
for getting stack traces about what application is doing.