Flink+Kafka High Memeory Usage - apache-flink

I have a basic flink job which reads from kafka as string, in a flatmap makes some string operations and sends messages back to kafka as string. No window or state. Message count is about 25K/second.
Backend is hashmap.
When I check the task manager on Flink UI, I see that heap memory usage sometimes goes up to 10G.
When I watch it I see that it changes something between 3GB-10GB.
I have no idea where this memory is used. a message is about 1KB so I receive 25MB data per second and without any state of window I write it back to Kafka. no keyby, no window, no state. nothing.
Any idea why memory usage is too high. Any advice would be very helpful for analyzing the problem.

Related

Flink to implement a job which should start processing events once its parent job has done bootstrapping

I have a use case to implement in which historic data processing needs to be done before my streaming job can start processing live events.
My streaming job will become part of already running system, which means data is already present. And this data first needs to be processed before my job starts processing the live streaming events.
So how should i design this, what i can think off are the following ways;
a) First process the historic data, once done than only start the streaming job.
b) Start the historic data processing & streaming job simultaneously. But keep buffering the events until the historic data has been processed.
c) Make one job having both the capabilities of historic data processing + streaming live events.
Pros & Cons of the above approaches;
Approach (a), simple but needs manual intervention. Plus as the historic data will take time to get loaded, and once done post that when i start the job what should be the flink consumer property to read from the stream - earliest, latest or timestamp based? Reason to think about it as the moment job starts it will be a fresh consumer with no offset/consumer group id registered with kafka broker (in my case it is Oracle streaming service)
Approach (b) buffer size should be large enough to withhold the events states. Also the window that will hold the events needs to buffer till 'x' timestamp value for the first time only while post that it should be 'y' value (ideally very very less than 'x' as the bootstrapping is already done) . How to make this possible?
Approach (c) sounds good, but historic processing is only for first time & most importantly post historic processing only buffered events need to be processed. So next time as no historic processing is reqd. so how would other stream knows that it should keep processing the events as no historic processing is reqd.
Appreciate any help/suggestions to implement & design my use case better.
The HybridSource is designed for this scenario. The current implementation has some limitations, but you may find it meets your needs.
You might go for the approach explained in the 2019 Flink Forward talk A Tale of Dual Sources.
From what I remember, their situation was slightly different, in that they had two sources for the same data, a historic store (S3) and a queue with fresh events (Kafka), yet the data content and processing was the same.
They attempted to write a custom source which read from both Kafka and S3 at the same time, but that failed due to some idiosyncrasies of Flink source initialization.
They also did something like approach b, but the buffered data would often become way too large to handle.
They ended up making one job that can read both sources, but only reads S3 at first, then terminates itself by throwing an exception, and upon being restarted by Flink, starts reading Kafka.
With this restarting trick, you can essentially get the advantages of a and c, without having to worry about needing any manual intervention for the switch.

commitOffsetsInFinalize() and checkmarks in Apache Beam

I am working on a Beam application that uses KafkaIO as an input
KafkaIO.<Long, GenericRecord>read()
.withBootstrapServers("bootstrapServers")
.withTopic("topicName")
.withConsumerConfigUpdates(confs)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer((Deserializer.class)
.commitOffsetsInFinalize()
.withoutMetadata();
I am trying to understand how exactly the commitOffsetsInFinalize() works.
How can the streaming job be finalized?
The last step in the pipeline is a custom DoFn that writes the messages to DynamoDb. Is there any way to manually call some finalize() method there, so that the offsets are committed after each successful execution of the DoFn?
Also I am having hard time understanding whats the relation between the checkpoints and the finalization ? If no checkpoint is enabled on the pipeline, will I still be able to finalize and get the commitOffsetsInFinalize() to work?
p.s The way the pipeline is right now, even with the commitOffsetsInFinalize() each message that is read, regardless whether there is a failure downstream is being committed, hence causing a data lose.
Thank you!
The finalize here is referring to the finalization of the checkpoint, in other words when the data has been durably committed into Beam's runtime state (such that worker failures/reassignment will be retried without having to read this message from Kafka again). This does not mean that the data in question has made its way the rest of the way through the pipeline.

Kinesis Data Analytics Flink: Continually Increasing Checkpoint Size

I am running a Flink application using the AWS Kinesis Data Analytics (KDA) service. My KDA Flink application last checkpoint size appears to be growing steadily over time. The sudden drops in checkpoint size you can see in the attached graph correspond with when I pushed changes out to the app, causing it to take a snapshot, update, and then restore from the snapshot. My concern is that once the application is no longer being actively developed, changes will not be deployed as regularly, and the checkpoint size could grow to eventually be too large.
Does anyone know what would cause the checkpoint size to grow continuously without end? I am using State TTL on all significant state and removing state in application code when it is no longer needed. Does the checkpoint size increasing indicate I have a bug in the code that handles state, or is something else potentially at play here?
Update: See https://stackoverflow.com/a/67435073/2000823 for a better answer.
AWS Kinesis Data Analytics (KDA) is currently based on Flink 1.8, where this documentation regarding state cleanup applies.
Note that
by default if expired state is not read, it won’t be removed, possibly leading to ever growing state
You can also activate cleanup during full snapshots (which seems to be occurring), and background cleanup (which sounds like what you want). Note that for some workloads, even if background cleanup is enabled, the default settings for background cleanup might be insufficient to keep up with the rate at which state should be cleaned up, and some tuning might be necessary.
By the way, background cleanup is enabled by default since Flink 1.10.
If this doesn't answer your question, please clarify precisely how state TTL is configured.

Clarification for State Backend

I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.

Flink Kinesis Consumer not storing last successfully processed sequence nos

We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.
KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.
But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.
A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.
Do we need to store these last successfully consumed sequence nos ourselves ?
Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.
Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.
Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.
Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.
To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.
Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.
However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.
Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.
Do you see any benefit in storing the offsets additionally?

Resources