How to recover Flink Sql jobs from checkpoint? - apache-flink

I am checking if Flink Sql Table with kafka connector can perform in EXACTLY_ONCE mode, My way is creating a table, set reasonable checkpoint interval, and use a simple tumble function on an event_time field and last restart my program.
Here is my detail progress:
1: Create a kafka table
CREATE TABLE IF NOT EXISTS LOG_TABLE(
id String,
...
...
event_timestamp timestamp(3), watermark for event_timestamp as ....
)
2: Start my Flink job as follow config
StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());
environment.getCheckpointConfig().setCheckpointInterval(30000L);
environment.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setCheckpointStorage(new FileSystemCheckpointStorage("file:///tmp/checkpoint/"));
environment.setStateBackend(new HashMapStateBackend());
environment.setParallelism(1);
EnvironmentSettings settings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
TableEnvironment tableEnvironment = StreamTableEnvironment.create(environment, settings);
tableEnvironment.getConfig().getConfiguration().setBoolean("table.exec.emit.early-fire.enabled", true);
tableEnvironment.getConfig().getConfiguration().setString("table.exec.emit.early-fire.delay", "1s");
3:Execute my sql
select tumble_end(event_timestamp, interval '5' minute),
count(1)
from LOG_TABLE
group by tumble(event_timestamp, interval '5' minute)
As we see,tumble window interval is 5 minutes and checkpoint interval is 30 seconds, every tumble window trigger 6 checkpoints.
In this case window state lost:
2:00:00 pm, Lunch the job, send 100 message.(Job id is bd208afa6599864831f008d429a527bb, chk1-3 triggered successfully, checkpoint dir created checkpoint files)
2:01:40 pm, Shutdown my job and modify CheckpointStorage directory to /tmp/checkpoint/bd208afa6599864831f008d429a527bb/chk-3
2:02:00 pm, Restart job and send another 100 message.
All the messages were sent in 2 minutes, so after restarting from checkpoint, job output should be 200, but the result was 100 and job lost the first job's state.
Is there any mistake in my progress? Please help to check, thanks.

Restarting a Flink job while preserving exactly-once guarantees requires launching the follow-on job in a special way so that the new job begins by restoring the state from the previous job. (Modifying the checkpoint storage directory, as you've done in step 2, isn't helpful.)
If you are using the SQL Client to launch the job, see Start a SQL Job from a savepoint, which involves doing something like this
SET 'execution.savepoint.path' = '/tmp/flink-savepoints/...';
before launching the query that needs the state to be restored.
If you are using the Table API, then the details depend on how you are launching the job, but you can use the command line with something like this
$ ./bin/flink run \
--detached \
--fromSavepoint /tmp/flink-savepoints/savepoint-cca7bc-bb1e257f0dab \
./examples/streaming/StateMachineExample.jar
or you might be using the REST API, in which case you will POST to /jars/:jarid/run with a savepointPath configured.
Note that you can use a retained checkpoint rather than a savepoint for restarting or rescaling your jobs. Also note that if you change the query in ways that render the old state incompatible with the new query, then none of this is going to work. See FLIP-190 for more on that topic.
The Flink Operations Playground is a tutorial that covers this and related topics in more detail.

Related

How to trigger the last proctime window when source kafka has little data being fed from time to time?

I intend to implement a batch-sync flink job based on UDAF (batch collecting and firing) + proctime window + checkpoint. It works fine when source kafka topic has messages coming in from time to time, which would trigger flink sending watermark.
But when the source kafka topic has less data being fed (say, arrpox. 1 message per 1h from 14:00-22:00), how to make the proctime window (INTERVAL 1 MINUTE) trigger as well so the latest data in current window would still be synchronized to the sink timely, not having to wait for uncertain time.
Could anyone give some suggestions? Thanks!
--update
I've found this explanation in flink document:
Does it mean when applying proctime window, watermark would be emitted by-design even when there's no data from source topic being fed?

Flink checkpoint is getting aborted

I am running Apache Flink mini cluster in Intellij.
Trying to setup a stream join where one stream is coming from kinesis source and other from jdbc.
When I am creating a datastream from table source like following :
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
I am getting following info message in the stack tracke :
INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task
Soource ...
job bcf73c5d7a0312d57c2ca36d338d4569 is not in state RUNNING but FINISHED instead. Aborting
checkpoint.
Flink checkpoints cannot happen if any of the job's tasks have run to completion. Perhaps your jdbc source has finished, and this is preventing any further checkpointing?
You can check your parallelism Settings,
if your program's parallelism is greater than the source's parallelism, then some task will show completion because it doesn't have the data, and that would abort a checkpoint.

Snowflake - Task not running

I have created a simple task with the below script and for some reason it never ran.
CREATE OR REPLACE TASK dbo.tab_update
WAREHOUSE = COMPUTE_WH
SCHEDULE = 'USING CRON * * * * * UTC'
AS CALL dbo.my_procedure();
I am using a snowflake trail enterprise version.
Did you RESUME? From the docs -- "After creating a task, you must execute ALTER TASK … RESUME before the task will run"
A bit of clarification:
Both the steps, while possibly annoying are needed.
Tasks can consume warehouse time (credits) repeatedly (e.g. up to
every minute) so we wanted to make sure that the execute privilege
was granted explicitly to a role.
Tasks can have dependencies and task trees (eventually DAGs)
shouldn't start executing as soon as one or more tasks are created.
Resume provides an explicit sync point when a data engineer can tell
us that the task tree is ready for validation and execution can
start at the next interval.
Dinesh Kulkarni
(PM, Snowflake)

CommitLog Recovery with Cassandra

I have noted following statement in the Cassandra documentation on commit log archive configuration:
https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configLogArchive.html
"Restore stops when the first client-supplied timestamp is greater than the restore point timestamp. Because the order in which the database receives mutations does not strictly follow the timestamp order, this can leave some mutations unrecovered."
This statement made us concerned about using point in time recovery based on Cassandra commit logs, since this indicates a point in time recovery will not recover all mutations with timestamp lower than the indicated restore point timestamp if we have mutations out of timestamp order (which we will have).
I tried to verify this behavior via some experiments but have not been able to reproduce this behavior.
I did 2 experiments:
Simple row inserts
Set restore_point_in_time to 1 hour ahead in time.
insert 10 rows (using default current timestamp)
insert a row using timestamp <2 hours ahead in time>
insert 10 rows (using default current timestamp)
Now I killed my cassandra instance making sure it was terminated without having a chance to flush to SS tables.
During startup I could see from cassandra logs that it was doing CommitLog replay.
After replay I queried by table and could see that 20 rows had been recovered but the one with the timestamp ahead of time was not inserted. Though here based on the documentation I would have expected that only the first 10 rows had been inserted. I verified in casssandra log that CommitLog replay had been done.
Larger CommitLog split experiment
I wanted to see if the documented feature then was working over a commitlog split/rollover.
So I set commitlog_segment_size_in_mb to 1 MB to cause the commitlog to rollover more frequently instead of the 32MB default.
I then ran a script to mass insert rows to force the commit log to split.
So the results here was that I inserted 12000 records, then inserted a record with a timestamp ahead of my restore_point_in_time then I inserted 8000 records afterwards.
At about 13200 rows my commitlog rolled over to a new file.
I then again killed my cassandra instance and restarted. Again I could see in the log that CommitLog replay was being done and after replay I could see that all rows except the single row with timestamp ahead of restore_point_in_time was recovered.
Notes
I did similar experiments using commitlog_sync batch option and also to make sure my rows had not been flushed to SSTables I tried restoring snapshot with empty tables before starting up cassandra to make it perform commitlog replay. In all cases I got the same results.
I guess my question is if the statement in the documentation is still valid? or maybe I'm missing something in my experiments?
Any help would be greatly appreciated ? I need an answer for this to be able to conclude on a backup/recovery mechanism we want to implement in a larger scale cassandra cluster setup.
All experiments where done using Cassandra 3.11 (single-node-setup) in a Docker container (the official cassandra docker image). I ran the experiments on the image "from-scratch" so no changes in configs where done other than what I included in the description here.
I think that it will be relatively hard to reproduce, as you'll need to make sure that some of the mutations come later than other, and this may happen mostly when some clients has not synchronized clocks, or nodes are overloaded, and then hints are replayed some time later, etc.
But this parameter may not be required at all - if you look into CommitLogArchiver.java, then you can see that if this parameter is not specified, then it's set to the Long.MAX, meaning that there is no upper bound and all commit logs will be replayed, and then Cassandra will handle it standard way: "the latest timestamp wins".

Change Data Capture (CDC) cleanup job only removes a few records at a time

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim
Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

Resources