Millisecond behind latest jumps after Flink version upgrade - apache-flink

I noticed a very strange behavior with a recent version bump from Flink 1.14.4 to 1.15.2. My project consumes around 30K records per second from a sharded kinesis stream, and during the version upgrade, it will follow the best practice to first trigger a savepoint from the running job, start the new job from the savepoint and then remove the old job. So far so good, and the above logic has been tested multiple times without any issue for 1.14.4. Usually, after the version upgrade, our job will have a few minutes delay for millisecond behind latest, but it will catch up with the speed quickly(within 30mins). Our savepoint is around one hundred MBs big, and our job DAG will become 90 - 100% busy with some backpressure when we redeploy but after 10-20 minutes it goes back to normal.
Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade from a running 1.14.4 job, I can see a savepoint has been created and the new job is running, all the metrics look fine, except suddenly millisecond behind the latest jumps to 10 hours!! and it takes days for my application to catch up with the kinesis stream latest record. I don't understand why it jumps from 0 second to 10+ hours when we restart the new job. The only main change I introduced with version bump is to change failOnError from true to false, but I don't think this is the root cause.
I have one assumption, I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying a job from 1.15.2 does not introduce a big delay, so I assume the issue above only happens when we bump version from 1.14.4 to 1.15.2? I did try to bump it twice and I see the same 10hrs+ jump in delay.
Any insights are welcome, thank you.

While looking through the Flink 1.15 changes related to the Kinesis consumer, there's nothing obvious that stands out for me. I would recommend filing a Jira ticket with the Flink community on this issue. See https://issues.apache.org/jira/projects/FLINK/issues

Related

Savepoint in Apache Flink with Large State

I want to keep state about 2TB in Flink using the Rocksdb state backend. I will use the incremental checkpoint, thus it will reduce the checkpoint time dramatically.
But I have to change code sometimes, e.g re-scaling, bug fix, adding new filter/mapping, adding new sources/sinks etc.
All of them can affect the job topology. I can bootstrap state again when any changes on state. But other times, bootstrap state could be difficult because that means time waste for me.
In these cases, I have to take a savepoint to restart my job. I also take savepoint periodically while job is running to restart job from the latest savepoint when the job is failed (e.g every 15 minutes). But the time while taking savepoint will be too long due to large state. MTTR (mean time to recovery) is very important for me. How can i improve savepoint performance?
You can use retained checkpoints for redeployments that don't change the topology, require a state migration, or upgrade the Flink version (e.g., rescaling, or simple code changes that don't affect state) -- but otherwise you should use a savepoint. And with large state, that can take quite a while (and I don't have any ideas for how to speed it up).
Rather than trying to improve savepoint performance, you might consider whether some sort of blue/green deployment strategy could work for you. For example, see Zero-downtime upgrades of Flink applications.

Difference between savepoint and checkpoint in Flink

I know there are similar questions on the stackoverflow,but after investigating several of them, I know
savepoint is triggered manually, while checkpoint is triggered
automatically
They are using different storage format
But these are not the confusing points,I have no idea when to use one or when to use the other.
Consider the following two scenarios:
If I need to shutdown or restart the whole application for some reason(eg bug fix or crash unexpected) , then I will have to use savepoint to restore the whole application?
I thought that checkpoint is only used internally in Flink for fault tolerance when application is running, that is, the application itself is running, but tasks or other things may fail, that is, Flink will use checkpoint for state recovery?
There is also externalized checkpoint, I think it is the same with savepoint in functionality, that is, externalized checkpoint can also be used to recover from a restarted application?
Does Flink use checkpoint for state recovery?
Basically you're right. As you said, the checkpoint is usually used internally in Flink for fault tolerance and it's more like a concept inside the framework. When your application fails, the program will try to restart from the latest checkpoint. That's how checkpoint works in Flink, without any mannual interfering.
Should I use savepoint to restore the whole application for bug fix?
Yes. In these cases, you don't want to restore from the checkpoint because maybe the latest checkpoint occurs several minutes ago. Instead, you'd like to snapshot the current the state of the whole application and restart it from the latest savepoint, which may be the quickest way to restore the application without too much delay.
Externalized checkpoint.
It's still the checkpoint, but will be persisted externally based on your configurations. It can be used to restore the application, but the states are not so real time because there exists an interval between checkpoints.
For more information, take a look at this blog artical: https://data-artisans.com/blog/differences-between-savepoints-and-checkpoints-in-flink.

Is it normal to need to restart Solr/Apache regularly?

It seems like about once every 2 to 3 months or so something about our solr implementation breaks down. Most recently the process we use to reindex our solr cores broke. It's a console application that just does two things: 1) Clear the indexes, 2) Rebuild the indexes. That's it, and it just does it by issuing http web requests to the server.
I was getting a 500 response when it was trying to clear the 1st core. None of the other cores had a problem though (except for the fact that they came after the 1st one and it is a synchronous process so nothing got reindexed). I spent a little time troubleshooting but ultimately I just restarted apache and it worked.
That seems to be the solution for everything. I wish I could remember previous issues I've run into, but something a little different seems to happen each time and it's always just much easier to reset apache than to spend the time to troubleshoot (plus it always happens in production and so spending hours troubleshooting isn't really a good option if I can fix it in seconds).
I wish these issues happened in staging or development where I could take time to investigate further, but it's always in production. So I'm starting to wonder if I should just create a task to reset Apache server every night.
When I reset it and it just suddenly works normally without me having to make a single change, I really have to wonder about the stability of Solr. Is this normal for others who use solr?

How to do rolling upgrade with zero downtime

Is it possible to do a job version update with zero downtime ?
Maybe with HA configuration ? i.e replacing the standby job with the updated one, next cancel the master which will cause the standby (updated) to be the master and then upload a new updated job instead of the master we cancelled in the previous phase, in order to maintain HA.
Is this scenario possible ? are there other scenarios that can achieve zero downtime on job version update ?
I don't think Flink HA mode is actually appropriate for zero-downtime job upgrades. HA mode ensures that a failing Jobmanager can be replaced without losing state information, but isn't HA in the sense that "unavailability" still occurs between the time the primary Jobmanager fails and the secondary Jobmanager takes over. (Or in the case of systems like Kubernetes, when the lone Jobmanager fails a healthcheck and is replaced)
For some types of jobs, zero-downtime upgrades are possible but not supported by Flink itself. For example, if your job outputs to an Elasticsearch index, you could bring up the upgraded job from a savepoint in parallel with the original but writing to a new index, and when it has caught up, switching your clients (or Elasticsearch index alias) to reference the new index.
Another technique I've considered but never tried would be to build into your applications a way to configure a flag that says when to start or stop emitting data. That way, you could update the configuration of the original job to drop (not forward to a sink) any windowed data starting at some timestamp in the near future, then run the upgraded job and configure it to emit its first window at that time.
Built-in support for zero-downtime "handoffs" is a feature that would be pretty nice to have in Flink for many use-cases.

Apache Flink, Job with big grap - submisson times out on cluster

We trying to build Flink Job for price aggregation with quite complicated logic.
E.g. previous version had a graph as below.
After another development iteration, I added even more complexity to the job.
The new version was running fine from IDE, however, deployment to cluster fails with
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out.
If I reconfigure the job (reduce graph complexity) it gets deployed without any problem.
My questions are:
Are there any limitations on Job Graph size and complexity when submitting to standalone cluster?
Is there any possibility to disable graphical graph representation (I have suspicions that the problem is caused by Graph view - locally my job works)
Are there any debug tools, to understand what is happening on the Job submission, and why it times out?
Thanks in advance.
The solution was to use latest flink version (1.5 at the time of writing).

Resources