Flink checkpoint is getting aborted - apache-flink

I am running Apache Flink mini cluster in Intellij.
Trying to setup a stream join where one stream is coming from kinesis source and other from jdbc.
When I am creating a datastream from table source like following :
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
I am getting following info message in the stack tracke :
INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task
Soource ...
job bcf73c5d7a0312d57c2ca36d338d4569 is not in state RUNNING but FINISHED instead. Aborting
checkpoint.

Flink checkpoints cannot happen if any of the job's tasks have run to completion. Perhaps your jdbc source has finished, and this is preventing any further checkpointing?

You can check your parallelism Settings,
if your program's parallelism is greater than the source's parallelism, then some task will show completion because it doesn't have the data, and that would abort a checkpoint.

Related

Flink JDBC Sink conditional multi-table insert

I have to implement a use case where I have to insert data (source is a Kafka topic) into multiple postgres tables (around 10 tables) in a transactional way i.e either all table inserts happen or if there is a failure in one insert, all the inserts for that particular record are failed. The insert failed records should also be captured and written into another Kafka topic.
Based on my understanding of the JDBC sink implementation, we can only provide one prepared statement per sink. Also, according to the invoke method signature of the GenericJdbcSinkFunction class -
public void invoke(T value, SinkFunction.Context context) throws IOException
It only throws an IOException. Is it possible to catch a SQL Exception having a failed insert and then writing that record in a separate Kafka topic? If yes, what happens to the rest of the records in that batch because if one record insert fails, I believe the whole batch fails.
Is it a good idea to use Flink for such use case?

How to trigger the last proctime window when source kafka has little data being fed from time to time?

I intend to implement a batch-sync flink job based on UDAF (batch collecting and firing) + proctime window + checkpoint. It works fine when source kafka topic has messages coming in from time to time, which would trigger flink sending watermark.
But when the source kafka topic has less data being fed (say, arrpox. 1 message per 1h from 14:00-22:00), how to make the proctime window (INTERVAL 1 MINUTE) trigger as well so the latest data in current window would still be synchronized to the sink timely, not having to wait for uncertain time.
Could anyone give some suggestions? Thanks!
--update
I've found this explanation in flink document:
Does it mean when applying proctime window, watermark would be emitted by-design even when there's no data from source topic being fed?

PyFlink: how to set parallelism when using SQL and Table API?

I have a processing topology using PyFlink and SQL where there is data skew: I'm splitting a stream of heterogenous data into separate streams based on the type of data that's in it and some of these substreams have very many more events than others and this is causing issues when checkpointing (checkpoints are timing out). I'd like to increase parallelism for these problematic streams, I'm just not sure how I do that and target just those elements. Do I need to use the datastream API here? What does this look like please?
I have a table defined and I duplicate a stream from that table, then filter so that my substream has only the events I'm interested in:
events_table = table_env.from_path(MY_SOURCE_TABLE)
filtered_table = events_table.filter(
col("event_type") == "event_of_interest"
)
table_env.create_temporary_view(MY_FILTERED_VIEW, filtered_table)
# now execute SQL on MY_FILTERED_VIEW
table_env.execute_sql(...)
The default parallelism of the overall table env is 1. Is there a way to increase the parallelism for just this stream?

How to recover Flink Sql jobs from checkpoint?

I am checking if Flink Sql Table with kafka connector can perform in EXACTLY_ONCE mode, My way is creating a table, set reasonable checkpoint interval, and use a simple tumble function on an event_time field and last restart my program.
Here is my detail progress:
1: Create a kafka table
CREATE TABLE IF NOT EXISTS LOG_TABLE(
id String,
...
...
event_timestamp timestamp(3), watermark for event_timestamp as ....
)
2: Start my Flink job as follow config
StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
environment.getCheckpointConfig().setCheckpointInterval(30000L);
environment.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setCheckpointStorage(new FileSystemCheckpointStorage("file:///tmp/checkpoint/"));
environment.setStateBackend(new HashMapStateBackend());
environment.setParallelism(1);
EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
TableEnvironment tableEnvironment = StreamTableEnvironment.create(environment, settings);
tableEnvironment.getConfig().getConfiguration().setBoolean("table.exec.emit.early-fire.enabled", true);
tableEnvironment.getConfig().getConfiguration().setString("table.exec.emit.early-fire.delay", "1s");
3:Execute my sql
select tumble_end(event_timestamp, interval '5' minute),
count(1)
from LOG_TABLE
group by tumble(event_timestamp, interval '5' minute)
As we see,tumble window interval is 5 minutes and checkpoint interval is 30 seconds, every tumble window trigger 6 checkpoints.
In this case window state lost:
2:00:00 pm, Lunch the job, send 100 message.(Job id is bd208afa6599864831f008d429a527bb, chk1-3 triggered successfully, checkpoint dir created checkpoint files)
2:01:40 pm, Shutdown my job and modify CheckpointStorage directory to /tmp/checkpoint/bd208afa6599864831f008d429a527bb/chk-3
2:02:00 pm, Restart job and send another 100 message.
All the messages were sent in 2 minutes, so after restarting from checkpoint, job output should be 200, but the result was 100 and job lost the first job's state.
Is there any mistake in my progress? Please help to check, thanks.
Restarting a Flink job while preserving exactly-once guarantees requires launching the follow-on job in a special way so that the new job begins by restoring the state from the previous job. (Modifying the checkpoint storage directory, as you've done in step 2, isn't helpful.)
If you are using the SQL Client to launch the job, see Start a SQL Job from a savepoint, which involves doing something like this
SET 'execution.savepoint.path' = '/tmp/flink-savepoints/...';
before launching the query that needs the state to be restored.
If you are using the Table API, then the details depend on how you are launching the job, but you can use the command line with something like this
$ ./bin/flink run \
--detached \
--fromSavepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
./examples/streaming/StateMachineExample.jar
or you might be using the REST API, in which case you will POST to /jars/:jarid/run with a savepointPath configured.
Note that you can use a retained checkpoint rather than a savepoint for restarting or rescaling your jobs. Also note that if you change the query in ways that render the old state incompatible with the new query, then none of this is going to work. See FLIP-190 for more on that topic.
The Flink Operations Playground is a tutorial that covers this and related topics in more detail.

Flink InvalidProgramException: Job was submitted in detached mode. Results of job execution, such as accumulators, runtime, etc. are not available

I have a Flink job running on a Kinesis Data Analytics application, which uses Flink's DataSet API to read data into two DataSet objects. Since I need the number of tuples in each DataSet, I call the count() method on each DataSet, but I keep seeing this error when run through my AWS Console:
org.apache.flink.api.common.InvalidProgramException: The main method caused an error: Job was submitted in detached mode. Results of job execution, such as accumulators, runtime, etc. are not available. Please make sure your program doesn't call an eager execution function [collect, print, printToErr, count].
For context, this is roughly the code that is causing the exception:
DataSet<Tuple> dataset = executionEnvironment.readTextFile(file);
log.info("Number of records: " + dataset.count());
Is there any way to change the execution mode from detached mode to another mode that would allow the call to the count and other accumulator functions?

Resources