How to know the status of a flink job from Java? - apache-flink

I have a job running, and I'm interested in to use only one recover retry, because in the meantime that this flink restart is not triggered I have a Thread that try to solve the problem, then when the problem was solved flink will restart, but sometimes the thread takes longer the usual to fix the issue and restart strategy is triggered, failing because of the issue still, then the job is stopped but the thread maybe has another iteration, and then the application never dies because I'm running it as a jar application. So, my question:
Is there anyway to know from java code the status of the job? Something like (JobStatus.CANCELED == true).
Thanks in advance!
Kind regards

Thanks a lot Felipe. This is what I was needing and thanks to you it is done. I share the code here in case of someone else needed.
Prepare the listener
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(...);
final AtomicReference<JobID> jobIdReference = new AtomicReference<>();
//Environment configurations
env.registerJobListener(new JobListener() {
#Override
public void onJobSubmitted(#Nullable JobClient jobClient, #Nullable Throwable throwable) {
assert jobClient != null;
jobIdReference.set(jobClient.getJobID());
jobClient = jobClient /*jobClient static public object in the main class*/;
}#Override
public void onJobExecuted(#Nullable JobExecutionResult jobExecutionResult, #Nullable Throwable throwable) {
assert jobExecutionResult != null;
jobExecutionResult.notify();
}
});
Use the code:
Preconditions.checkNotNull(jobClient);
final String status = jobClient.getJobStatus().get().name();
if (status.equals(JobStatus.FAILED.name())) System.exit(1);

Related

How to get the Ingestion time of an event when the time characteristic is event-time?

I want to evaluate the time costed between an event reaches the system and get finished, and I think getting ingestion time will help, but how to do get it?
You probably want to use latency tracking. Alternatively, you can add the processing time directly after the source in a chained process function (with Context->TimerService#currentProcessingTime()).
Based on the reply from David, to get the ingest time we can chain the process method with source.
Below code shows the way to get the ingest time. Also in case the same need to be used for metrics to get the difference between ingest time & event time, I have used histogram metric group to do that.
Below code snippet might help you to better understand.
DataStream<EventDataMapping> text = env
.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),"Kafka Source")
.process(new ProcessFunction<EventDataMapping, EventDataMapping>() {
private transient DescriptiveStatisticsHistogram eventVsIngestionTimeLag;
private static final int EVENT_TIME_LAG_WINDOW_SIZE = 10_000;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
eventVsIngestionTimeLag = getRuntimeContext().getMetricGroup().histogram("eventVsIngestionTimeLag",
new DescriptiveStatisticsHistogram(EVENT_TIME_LAG_WINDOW_SIZE));
}
#Override
public void processElement(EventDataMapping eventDataMapping, Context context, Collector<EventDataMapping> collector) throws Exception {
LOG.info("process element event time "+context.timestamp()+" current ingestTime "+context.timerService().currentProcessingTime());
eventVsIngestionTimeLag.update(context.timerService().currentProcessingTime() - context.timestamp());
}
}).returns(EventDataMapping.class);

Can async subscriber example lose messages?

Starting off with pubsub. When reading the google cloud documentation, i ran into a snippet of code, and i think i see a flaw with the example.
This is the code i am talking about. It uses the async subscriber.
public class SubscriberExample {
private static final String PROJECT_ID = ServiceOptions.getDefaultProjectId();
private static final BlockingQueue<PubsubMessage> messages = new LinkedBlockingDeque<>();
static class MessageReceiverExample implements MessageReceiver {
#Override
public void receiveMessage(PubsubMessage message, AckReplyConsumer consumer) {
messages.offer(message);
consumer.ack();
}
}
public static void main(String... args) throws Exception {
String subscriptionId = args[0];
ProjectSubscriptionName subscriptionName = ProjectSubscriptionName.of(
PROJECT_ID, subscriptionId);
Subscriber subscriber = null;
try {
subscriber =
Subscriber.newBuilder(subscriptionName, new MessageReceiverExample()).build();
subscriber.startAsync().awaitRunning();
while (true) {
PubsubMessage message = messages.take();
processMessage(message);
}
} finally {
if (subscriber != null) {
subscriber.stopAsync();
}
}
}
My question is, what if a bunch of messages have been acknowledged, and the BlockingQueue is not empty, and the server crashes. Then i would lose some messages right? (Acknowledged in PubSub, but not actually processed).
Wouldn't the best implementation be to only acknowledge the message after the it has been processed? Instead of acknowledging it and leaving it on a queue, and assuming it will be processed. I understand this will decouple the receiving of messages and process of messages, and potentially increase throughput, but still it risks losing messages right?
Yes, one should not acknowledge a message until it has been fully processed. Otherwise, the message may never be processed because it will not be redelivered in the event of a crash or restart if it was acknowledged. I have entered an issue to update the example.

Flink RichParallelSourceFunction - close() vs cancel()

I am implementing a RichParallelSourceFunction which reads files over SFTP. RichParallelSourceFunction inherits cancel() from SourceFunction and close() from RichFunction(). As far as I understand it, both cancel() and close() are invoked before the source is teared down. So in both of them I have to add logic for stopping the endless loop which reads files.
When I set the parallelism of the source to 1 and I run the Flink job from the IDE, Flink runtime invokes stop() right after it invokes start() and the whole job is stopped. I didn't expect this.
When I set the parallelism of the source to 1 and I run the Flink job in a cluster, the job runs as usual.
If I leave the parallelism of the source to the default (in my case 4), the job runs as usual.
Using Flink 1.7.
public class SftpSource<TYPE_OF_RECORD>
extends RichParallelSourceFunction<TYPE_OF_RECORD>
{
private final SftpConnection mConnection;
private boolean mSourceIsRunning;
#Override
public void open(Configuration parameters) throws Exception
{
mConnection.open();
}
#Override
public void close()
{
mSourceIsRunning = false;
}
#Override
public void run(SourceContext<TYPE_OF_RECORD> aContext)
{
while (mSourceIsRunning)
{
synchronized ( aContext.getCheckpointLock() )
{
// use mConnection
// aContext.collect() ...
}
try
{
Thread.sleep(1000);
}
catch (InterruptedException ie)
{
mLogger.warn("Thread error: {}", ie.getMessage() );
}
}
mConnection.close();
}
#Override
public void cancel()
{
mSourceIsRunning = false;
}
}
So I have workarounds and the question is more about the theory. Why is close() invoked if parallelism is 1 and the job is run from the IDE (i.e. from the command line)?
Also, do close() and cancel() do the same in a RichParallelSourceFunction?
Why is close() invoked if parallelism is 1 and the job is run from the
IDE.
close is called after the last call to the main working methods (e.g. map or join). This method can be used for clean up work.
It will be called independent of the number defined in parallelism.
Also, do close() and cancel() do the same in a RichParallelSourceFunction?
They aren't the same thing, take a look at how it's described.
Cancels the source. Most sources will have a while loop inside the run(SourceContext) method. The implementation needs to ensure that the source will break out of that loop after this method is called.
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/functions/source/SourceFunction.html#cancel--
The following link may help you to understand the task lifecycle:
https://ci.apache.org/projects/flink/flink-docs-stable/internals/task_lifecycle.html#operator-lifecycle-in-a-nutshell
I think javadocs are more than self-explanatory:
Gracefully Stopping Functions
Functions may additionally implement the {#link org.apache.flink.api.common.functions.StoppableFunction} interface. "Stopping" a function, in contrast to "canceling" means a graceful exit that leaves the state and the emitted elements in a consistent state.
-- SourceFunction.cancel
Cancels the source. Most sources will have a while loop inside the run(SourceContext) method. The implementation needs to ensure that the source will break out of that loop after this method is called.
A typical pattern is to have an "volatile boolean isRunning" flag that is set to false in this method. That flag is checked in the loop condition.
When a source is canceled, the executing thread will also be interrupted (via Thread.interrupt()). The interruption happens strictly after this method has been called, so any interruption handler can rely on the fact that this method has completed. It is good practice to make any flags altered by this method "volatile", in order to guarantee the visibility of the effects of this method to any interruption handler.
-- SourceContext.close
This method is called by the system to shut down the context.
Note, you can cancel SourceFunction, but stop SourceContext
I found a bug in my code. Here is the fix
public void open(Configuration parameters) throws Exception
{
mConnection.open();
mSourceIsRunning = true;
}
Now close() is not invoked until I decide to stop the workflow in which case first is invoked cancel() and then close(). I am still wondering how did parallelism affect the behaviour.

System.AbortJob not working?

I'm trying to figure out how to get System.AbortJob() to actually work. My assumption (could be wrong) is that when I pass the current jobId to System.AbortJob(), the job will stop right there and abort. Here is my test that isn't working, as I am seeing the System.debug() showing up in my logs.
Executing from execute anonymous:
queueableTest tst = new queueableTest();
System.enqueueJob(tst);
Queueable class:
public class queueableTest implements Queueable {
public static void execute(QueueableContext Context)
{
ID jobID = Context.getJobId();
System.AbortJob(jobID);
shouldntExecute();
}
public static void shouldntExecute()
{
System.debug('Why is this executing?');
}
}
Any help/feedback greatly appreciated!
The System.abortJob() call will only take effect after its execution context is completed. Since you are calling abortJob from the same context that you want to abort, by the time it takes effect your code has already finished executing which makes the System.abortJob() call irrelevant.
If you want to abort the current job, you need to use return; or System.assert(false, 'Aborting'); In the first case your job will terminate with a status of 'Completed', and in the second case with a 'Failed' status. Throwing an exception would also have the same result.

Apache Camel Main-Class and its methods start, stop, run, suspend and resume

I am writing my first camel application. it is a standalone application with a main method. As starting point i used the maven camel java archetype. It provides a simple main method that calls main.run().
Now i re-factored it a little bit and pulled the main.run out in a new class (and method) that will be my main-control of all camel stuff.
Now i want to create the "opposite" method of run(). At the moment i want to implement tests for single routs that start (run()) the context then wait (at the moment i am unsure how to wait 'til a route is finished) and the stop the context.
But now i discovered many method that could start and stop stuff all in Main class. The Jvadoc didn't help - that some methods are inherited doesn't make it easier ;-). So someone please tell me the exact meaning (or use case) for:
Main.run()
Main.start()
Main.stop()
Main.suspend()
Main.resume()
Thanks in advance.
See this page about the lifecycle of the various Camel services
http://camel.apache.org/lifecycle
And for waiting until a route is finished, then you can check the inflight registry if there is any current in-flight exchanges to know if a route is finished.
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/spi/InflightRepository.html
We must separate the methods into 2 groups.
The first is the one described in the life cycle http://camel.apache.org/lifecycle
The second is composed of run and shutdown.
run runs indefinitely and can be stopped when invoking shutdown, the latter must be invoked in a different thread and sent before the run invocation.
Example:
import org.apache.camel.main.Main;
public class ShutdownTest {
public static void main(String[] args) throws Exception {
Main camel = new Main();
camel.addRouteBuilder( new MyRouteBuilder() );
// In this case the thread will wait a certain time and then invoke shutdown.
MyRunnable r = new MyRunnable(5000, camel);
r.excecute();
camel.run();
}
}
Simple Runnable class
public class MyRunnable implements Runnable {
long waitingFor = -1;
Main camel;
public MyRunnable(long waitingFor, Main camel){
this.waitingFor = waitingFor;
this.camel = camel;
}
public void excecute(){
Thread thread = new Thread(this);
thread.start();
}
#Override
public void run() {
try {
synchronized (this) {
this.wait( waitingFor );
}
} catch (InterruptedException e) {
}
try {
System.out.println("camel.shutdown()");
camel.shutdown();
} catch (Exception e) {
e.printStackTrace();
}
}
}

Resources