I saw this line in Flink documentation but can't figure out what 'detached mode' means. Please help. Thanks.
Run example program in detached mode:
./bin/flink run -d ./examples/batch/WordCount.jar
The Flink CLI runs jobs either in blocking or detached mode. In blocking mode, the CliFrontend (client) process keeps running, blocked, waiting for the job to complete -- after which it will print out some information. In the example below I ran a streaming job, which I cancelled from the WebUI after a few seconds:
$ flink run target/oscon-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID b02da01c30585bfbc86a23446559987f has finished.
Job Runtime: 8673 ms
If you run in blocking mode, you can kill the CliFrontend (e.g., with ctrl-C) if you like, and the job will be unaffected, so long as it has run far enough to have submitted the job to the cluster.
In detached mode, the CliFrontend submits the job to the cluster and then exits straight away.
That means that the application is not attached (or bound) to your shell session. So if you close your terminal the application will still keep running (until it finished its work). For a batch example that might not be a big problem - they will process the given batch of data and end afterwards. As soon as you skip to a streaming approach the operations will take place on an "infinite stream of data" and have no defined end.
Hope that helps.
Related
I'm using Flink 1.15 Docker images in Session mode pretty much the same as the Compose documentation. I have one Task Manager. A few minutes after starting my streaming job I get a stack dump log message from my Job Manager stating that the Task Manager is no longer reachable and I see that my Task Manager Docker container has exited with code 137 - which possibly indicates an out of memory error. Although docker inspect shows the OOMKilled flag as false indicating some sort of other issue.
End of stack trace from Job Manager log:
Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 172.18.0.5:44333-7c7193 is no longer reachable.
The TaskManager Docker logs produce no error whatsoever before exiting. If I resurrect the dead Task Manager Docker container and have a look at the log file in /opt/flink/logs/ then the last messages state that the various components in my pipeline have switched from INITIALIZING to RUNNING.
I would have expected an out of memory stack dump from the task manager if my state had become too large. Also docker inspect shows that the container did not exit because of an out of memory error.
I have no idea what causes my Task Manager to die. Any ideas how I can figure out what is causing the issue? (This happens on 1.15.1 & 1.15.2. I haven't used any other version of Flink.)
This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
I know you said docker inspect doesn't show that it shuts down because of memory issues, but still try to use more RAM or decrease the memory requirements of your tasks and see if it still crashes.
I ended up using nothing more sophisticated than trial and error with a variety of different test jobs. I'm not 100% sure I fixed the problem as the issue of the Task Manager crashing without an stack dump occurred sporadically. However the Task Manager hasn't crashed for several days.
The simplest job to recreate my issue was with a SourceFunction outputting a continuous stream of incrementing Longs straight to a DiscardingSink. With this setup the Task Manager would crash after a while on my Linux machine sporadically but never on my Mac.
If I added a Thread.sleep to the SourceFunctions run loop then the crash would eventually occur but take a bit longer.
I tried Source framework instead of SourceFunction where a SingleThreadMultiplexSourceReaderBase repeatedly calls fetch on a SplitReader to output the Longs. There have been fewer crashes since I did this so it didn't work 100%.
I presume my SourceFunction was overfilling some sort of buffer or making a task slot unresponsive as it never relinquished a slot once it started. (Or some other completely different explanation.)
I wish the Task Manager gave some sort of indication why it stopped running.
Here is my question :
Will filnk finish the sink process and rename the .inprogress files to part-x-x files when sending a stop command?
I find my flink tasks(using flink-1.9.1) will not rename the .inprogress files to part-x-x files. But I read the source code, it says
org.apache.flink.client.program.ClusterClient#stopWithSavepoint:
* Stops a program on Flink cluster whose job-manager is configured in this client's configuration.
* Stopping works only for streaming programs. Be aware, that the program might continue to run for
* a while after sending the stop command, because after sources stopped to emit data all operators
* need to finish processing.
The StreamingFileSink does have some limitations in this regard. See this thread from the user#flink.apache.org mailing list.
FLIP-46, which is being tracked as FLINK-13103, is needed in order to fix this. Until then, the StreamingFileSink will remain unable to transition unfinished files to the finished state when a job is stopped. This is described in the documentation as Important Note 2.
I am using the detach and yarn-cluster mode to run the flink application in job mode as follows:
flink run -d -m yarn-cluster -yn 10 -ys 1 -yqu QueueA -c com.me.MyFlinkApplicaiton
The application starts up and the job in this application starts to consume message from Kafka successuflly.
After running smoothly for serverl hours, the flink yarn application is alive/running, but the job in this application disappears(there is no job/task running any more), all the slots are freed.
My application is a simple read from Kafka source -> sink to mongodb application, and I have try/catch the whole sink function's invoke method, so there will no exception throws in sink function.
I didn't find usefull log to investigate this problem ,so I would ask what may happen that may cause this behavior
Ok, looks I have found out the problem, I have specified the restart strategy in the code as
env.setRestartStrategy(RestartStrategies.noRestart())
When the tm exits and jm is cancelled, flink will not try to restart the jm and tm.
Spark streaming provides API for termination awaitTermination(). Is there any similar API available to gracefully shut down flink streaming after some t seconds?
Your driver program (i.e. the main method) in Flink doesn't stay running while the streaming job executes. Your program should define a dataflow, call execute, and then terminate. In Spark, the driver program stays running (AFAIK), and awaitTermination relates to that.
Note that a Flink streaming dataflow continues to execute indefinitely, unless you're using a 'bounded' data source with a finite number of elements. You may also cancel or stop a job, and even take a checkpoint upon stopping to be resumed from later.
I have an system running embedded linux and it is critical that it runs continuously. Basically it is a process for communicating to sensors and relaying that data to database and web client.
If a crash occurs, how do I restart the application automatically?
Also, there are several threads doing polling(eg sockets & uart communications). How do I ensure none of the threads get hung up or exit unexpectedly? Is there an easy to use watchdog that is threading friendly?
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.
Which leaves only the problem of detecting a hung process. You can use any of the solutions pointed out by Michael Aaron Safyan for this, but a yet easier solution would be to use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
That way, no extra programs needed, and only portable POSIX stuff used.
The gist of it is:
You need to detect if the program is still running and not hung.
You need to (re)start the program if the program is not running or is hung.
There are a number of different ways to do #1, but two that come to mind are:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
With respect to #2, killing the previous PID and using fork+exec to launch a new process is typical. You might also consider making your application that runs "continuously", into an application that runs once, but then use "cron" or some other application to continuously rerun that single-run application.
Unfortunately, watchdog timers and getting out of deadlock are non-trivial issues. I don't know of any generic way to do it, and the few that I've seen are pretty ugly and not 100% bug-free. However, tsan can help detect potential deadlock scenarios and other threading issues with static analysis.
You could create a CRON job to check if the process is running with start-stop-daemon from time to time.
use this script for running your application
#!/bin/bash
while ! /path/to/program #This will wait for the program to exit successfully.
do
echo “restarting” # Else it will restart.
done
you can also put this script on your /etc/init.d/ in other to start as daemon