Flink streaming: how to control the execution time - apache-flink

Spark streaming provides API for termination awaitTermination(). Is there any similar API available to gracefully shut down flink streaming after some t seconds?

Your driver program (i.e. the main method) in Flink doesn't stay running while the streaming job executes. Your program should define a dataflow, call execute, and then terminate. In Spark, the driver program stays running (AFAIK), and awaitTermination relates to that.
Note that a Flink streaming dataflow continues to execute indefinitely, unless you're using a 'bounded' data source with a finite number of elements. You may also cancel or stop a job, and even take a checkpoint upon stopping to be resumed from later.

Related

Multithreaded reads and writes to a single SQlite database using the C API

My application consists of a number of threads (typically 5-10) and each is responsible for reading a value from an SQLite database, working on it for an amount of time and then writing a new value back to the database.
Each of the threads is running by itself without any synchronization.
My question is: Do I need to write the synchronization code myself or it is possible to interact with the SQLite C API in such a way that this gets taken care of for you. IE. If there is a transaction in progress to write a value and a different thread tries to write or read the same row, SQLite will block until it's okay to do so?
Do I need to write the synchronization code myself or it is possible to interact with the SQLite C API in such a way that this gets taken care of for you.
The SQLite documentation covers this. In a nutshell, SQLite has three different threading models available:
single-thread, in which all internal mutexes are disabled, and the SQLite API cannot safely be used by multiple threads at the same time;
multi-thread, in which the library can safely be used concurrently by multiple threads, but database connections can be used only by a single thread at a time; and
serialized, in which multiple threads can use the API concurrently without restriction.
There is a built-in default selected at compile time (which is "serialized" mode in a standard build). A different mode can be selected during library initialization (sqlite3_config()), and a per-connection thread mode can be specified when you open a new connection, except that single-thread mode cannot be overridden.
But note well that under all circumstances, SQLite provides a single transaction at a time per connection. Thus, you may need your own synchronization even in "serialized" mode.
If there is a transaction in progress to write a value and a different thread tries to write or read the same row, SQLite will block until it's okay to do so?
If an SQLite connection is in serialized mode and has autocommit enabled, then you're fine, in the sense that each statement will be executed in its own transaction, and different threads will not interfere with each other (but they may counteract each other). If a connection is instead in the weaker "multi-thread" mode then you must provide your own synchronization, so that different threads do not attempt to use the connection concurrently. If autocommit is disabled, then you will probably need to synchronize even in "serialized" mode to accommodate multiple threads using the same connection, else you will be unable to effectively control transaction boundaries and the contents of each transaction.
With respect to a single, established connection, there is no meaningful difference between "multi-thread" mode and "single-thread" mode.
If there is a transaction in progress to write a value and a different thread tries to write or read the same row, SQLite will block until it's okay to do so?
The easy way to do this is use mult-threaded mode, with one connection object per thread. Then threads can acquire a lock on the database with a BEGIN IMMEDIATE transaction, and if you get a SQLITE_BUSY error, the thread can either do something else for a while before trying again, or if you had set up a busy timeout ahead of time to a reasonable for your needs timeout, the thread will just sleep for up to that length of time trying to acquire the transaction lock periodically before giving up.
If you use serialized mode and a single connection for the entire program, you have to include all the logic and locking to make sure that only one particular thread can access the database when a transaction is active yourself. Much easier to use sqlite's native, well tested, support for that functionality.

Will Flink finish the sink process when sent a stop command

Here is my question :
Will filnk finish the sink process and rename the .inprogress files to part-x-x files when sending a stop command?
I find my flink tasks(using flink-1.9.1) will not rename the .inprogress files to part-x-x files. But I read the source code, it says
org.apache.flink.client.program.ClusterClient#stopWithSavepoint:
* Stops a program on Flink cluster whose job-manager is configured in this client's configuration.
* Stopping works only for streaming programs. Be aware, that the program might continue to run for
* a while after sending the stop command, because after sources stopped to emit data all operators
* need to finish processing.
The StreamingFileSink does have some limitations in this regard. See this thread from the user#flink.apache.org mailing list.
FLIP-46, which is being tracked as FLINK-13103, is needed in order to fix this. Until then, the StreamingFileSink will remain unable to transition unfinished files to the finished state when a job is stopped. This is described in the documentation as Important Note 2.

What is Apache Flink's detached mode?

I saw this line in Flink documentation but can't figure out what 'detached mode' means. Please help. Thanks.
Run example program in detached mode:
./bin/flink run -d ./examples/batch/WordCount.jar
The Flink CLI runs jobs either in blocking or detached mode. In blocking mode, the CliFrontend (client) process keeps running, blocked, waiting for the job to complete -- after which it will print out some information. In the example below I ran a streaming job, which I cancelled from the WebUI after a few seconds:
$ flink run target/oscon-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID b02da01c30585bfbc86a23446559987f has finished.
Job Runtime: 8673 ms
If you run in blocking mode, you can kill the CliFrontend (e.g., with ctrl-C) if you like, and the job will be unaffected, so long as it has run far enough to have submitted the job to the cluster.
In detached mode, the CliFrontend submits the job to the cluster and then exits straight away.
That means that the application is not attached (or bound) to your shell session. So if you close your terminal the application will still keep running (until it finished its work). For a batch example that might not be a big problem - they will process the given batch of data and end afterwards. As soon as you skip to a streaming approach the operations will take place on an "infinite stream of data" and have no defined end.
Hope that helps.

Polling a database versus triggering program from database?

I have a process wherein a program running in an application server must access a table in an Oracle database server whenever at least one row exists in this table. Each row of data relates to a client requesting some number crunching performed by the program. The program can only perform this number crunching serially (that is, for one client at a time rather than multiple clients in parallel).
Thus, the program needs to be informed of when data is available in the database for it to process. I could either
have the program poll the database, or
have the database trigger the program.
QUESTION 1: Is there any conventional wisdom why one approach might be better than the other?
QUESTION 2: I wonder if programs have any issues "running" for months at a time (would any processes in the server stop or disrupt the program from running? -- if so I don't know how I'd learn there was a problem unless from angry customers). Anyone have experience running programs on a server for a long time without issues? Or, if the server does crash, is there a way to auto-start a (i.e. C language executable) program on it after the server re-boots, thus not requiring a human to start it specifically?
Any advice appreciated.
UPDATE 1: Client is waiting for results, but a couple seconds additional delay (from polling) isn't a deal breaker.
I would like to give a more generic answer...
There is no right answer that applies every time. Some times you need a trigger, and some times is better to poll.
But… 9 out of 10 times, polling is much more efficient, safe and fast than triggering.
It's really simple. A trigger needs to instantiate a single program, of whatever nature, for every shot. That is just not efficient most of the time. Some people will argue that that is required when response time is a factor, but even then, half of the times polling is better because:
1) Resources: With triggers, and say 100 messages, you will need resources for 100 threads, with 1 thread processing a packet of 100 messages you need resources for 1 program.
2) Monitoring: A thread processing packets can report time consumed constantly on a defined packet size, clearly indicating how it is performing and when and how is performance being affected. Try that with a billion triggers jumping around…
3) Speed: Instantiating threads and allocating their resources is very expensive. And don’t get me started if you are opening a transaction for each trigger. A simple program processing a say 100 meessage packet will always be much faster that initiating 100 triggers…
3) Reaction time: With polling you can not react to things on line. So, the only exception allowed to use polling is when a user is waiting for the message to be processed. But then you need to be very careful, because if you have lots of clients doing the same thing at the same time, triggering might respond LATER, than if you where doing fast polling.
My 2cts. This has been learned the hard way ..
1) have the program poll the database, since you don't want your database to be able to start host programs (because you'd have to make sure that only "your" program can be started this way).
The classic (and most convenient IMO) way for doing this in Oracle would be through the DBMS_ALERT package.
The first program would signal an alert with a certain name, passing an optional message. A second program which registered for the alert would wait and receive it immediatly after the first program commits. A rollback of the first program would cancel the alert.
Of cause you can have many sessions signaling and waiting for alerts. However, an alert is a serialization device, so if one program signaled an alert, other programs signaling the same alert name will be blocked until the first one commits or rolls back.
Table DBMS_ALERT_INFO contains all the sessions which have registered for an alert. You can use this to check if the alert-processing is alive.
2) autostarting or background execution depends on your host platform and OS. In Windows you can use SRVANY.EXE to run any executable as a service.
I recommend using a C program to poll the database and a utility such as monit to restart the C program if there are any problems. Your C program can touch a file once in a while to indicate that it is still functioning properly, and monit can monitor the file. Monit can also check the process directly and make sure it isn't using too much memory.
For more information you could see my answer of this other question:
When a new row in database is added, an external command line program must be invoked
Alternatively, if people aren't sitting around waiting for the computation to finish, you could use a cron job to run the C program on a regular basis (e.g. every minute). Then monit would be less needed because your C program will start and stop all the time.
You might want to look into Oracle's "Change Notification":
http://docs.oracle.com/cd/E11882_01/appdev.112/e25518/adfns_cqn.htm
I don't know how well this integrates with a "regular" C program though.
It's also available through .Net and Java/JDBC
http://docs.oracle.com/cd/E11882_01/win.112/e23174/featChange.htm
http://docs.oracle.com/cd/E11882_01/java.112/e16548/dbchgnf.htm
There are simple job managers like gearman that you can use to send a job message from the database to a worker. Gearman has among others a MySQL user defined function interface, so it is probably easy to build one for oracle as well.

Linux automatically restarting application on crash - Daemons

I have an system running embedded linux and it is critical that it runs continuously. Basically it is a process for communicating to sensors and relaying that data to database and web client.
If a crash occurs, how do I restart the application automatically?
Also, there are several threads doing polling(eg sockets & uart communications). How do I ensure none of the threads get hung up or exit unexpectedly? Is there an easy to use watchdog that is threading friendly?
You can seamlessly restart your process as it dies with fork and waitpid as described in this answer. It does not cost any significant resources, since the OS will share the memory pages.
Which leaves only the problem of detecting a hung process. You can use any of the solutions pointed out by Michael Aaron Safyan for this, but a yet easier solution would be to use the alarm syscall repeatedly, having the signal terminate the process (use sigaction accordingly). As long as you keep calling alarm (i.e. as long as your program is running) it will keep running. Once you don't, the signal will fire.
That way, no extra programs needed, and only portable POSIX stuff used.
The gist of it is:
You need to detect if the program is still running and not hung.
You need to (re)start the program if the program is not running or is hung.
There are a number of different ways to do #1, but two that come to mind are:
Listening on a UNIX domain socket, to handle status requests. An external application can then inquire as to whether the application is still ok. If it gets no response within some timeout period, then it can be assumed that the application being queried has deadlocked or is dead.
Periodically touching a file with a preselected path. An external application can look a the timestamp for the file, and if it is stale, then it can assume that the appliation is dead or deadlocked.
With respect to #2, killing the previous PID and using fork+exec to launch a new process is typical. You might also consider making your application that runs "continuously", into an application that runs once, but then use "cron" or some other application to continuously rerun that single-run application.
Unfortunately, watchdog timers and getting out of deadlock are non-trivial issues. I don't know of any generic way to do it, and the few that I've seen are pretty ugly and not 100% bug-free. However, tsan can help detect potential deadlock scenarios and other threading issues with static analysis.
You could create a CRON job to check if the process is running with start-stop-daemon from time to time.
use this script for running your application
#!/bin/bash
while ! /path/to/program #This will wait for the program to exit successfully.
do
echo “restarting” # Else it will restart.
done
you can also put this script on your /etc/init.d/ in other to start as daemon

Resources