Are there limitations in using a State in Apache Flink? - apache-flink

Apache Flink allows me to use a State in a RichMapFunction. I am planning to build a continuously running job which analyses a stream of web events. Part of the processing will be the creation of a session context with session scoped metrics (like nth of the session, duration etc) and additionally a user context.
A session context will timeout after 30 minutes, but a user context may exist for a year to handle returning users.
There will be millions of sessions and users so I would end up in millions of individual states. Every state is just a few KB in size.
Is this something that can be handled properly with the Flink states?
How is Flink actually cleaning up deprecated states?
Would it make sense to think about providing a custom backend to store the state in a KV cluster?

For large state I would recommend using Flink's RocksDBStateBackend. This state backend uses RocksDB to store state. Since RocksDB gracefully spills to disk, it is only limited by your available disk space. Thus, Flink should be able to handle your use case.
At the moment you need to register timers to clean up state. However, with the next Flink release, the community will add clean up for state with TTL. This will then automatically clean up your state when it is expired.
Keeping your state close to your computation with periodic checkpoints which are persisted will keep your application fast. If every state access went to a remote KV cluster, it would considerably slow down the processing.

Related

Access to Subtask Metrics in a KeyedStream

I want to load a large amount of data into Flinks state backend (RocksDB) and process events using this data (for example in a CoProcessFunction). The data is much larger than the memory of a TaskManager.
The data blocks for a key can be very large, which has a negative impact on latency if the data needs to be loaded from the state backend each time. Therefore I would like to keep the data for frequent keys locally in the CoProcessFunction.
However, the total data in the state backend is larger than the memory of a TaskManager so it is not possible to keep the corresponding data block from the state backend locally for each key.
To solve this problem I would need to know the current memory usage of a SubTask to decide if a data block for a key can be kept locally or if something needs to be deleted. So here is my question: Since keys are not clearly assigned to a subtask is there a way to access memory related subtask information or custom metrics related to subtasks in a KeyedStream ? Or is there another way to solve this problem ? (External access via Async I/O is not an option).
The RocksDB block cache is already doing roughly what you describe. Rather than implementing your own caching layer, you should be able to get good results by tuning RocksDB, including giving it plenty of memory to work with.
Using RocksDB State Backend in Apache Flink: When and How is a good starting point, and includes pointers to where you can learn more about the native RocksDB metrics, memory management, etc. I also recommend reading The Impact of Disks on RocksDB State Backend in Flink: A Case Study.
As for your question about accessing subtask metrics from within your Flink job -- I don't know of any way to do that locally. You could, I suppose, implement a Flink source connector that fetches them and streams them into the job as another data source.

Apache Flink Checkpoining (Manually put a value into RocksDB Checkpoint and retrieve during recovery or Restart)

We have a scenario where we have to persist/save some value into the checkpoint and retrieve it back during failure recovery/application restart.
We followed a few things like ValueState, ValueStateDescriptor still not working.
https://github.com/realtime-storage-engine/flink-spillable-statebackend/blob/master/flink-spillable-benchmark/src/main/java/org/apache/flink/spillable/benchmark/WordCount.java
https://towardsdatascience.com/heres-how-flink-stores-your-state-7b37fbb60e1a
https://github.com/king/flink-state-cache/blob/master/examples/src/main/java/com/king/flink/state/Example.java
We can't externalize it to a DB as it may cause some performance issues.
Any lead to this will be helpful using checkpoint. How to put and get back from a Checkpoint?
All of your managed application state is automatically written into Flink checkpoints (and savepoints). This includes
keyed state (ValueState, ListState, MapState, etc)
operator state (ListState, BroadcastState, etc)
timers
This state is automatically restored during recovery, and can optionally be restored during manual restarts.
The Flink Operations Playground shows how to work with checkpoints and savepoints, and lets you observe their behavior during failure/recovery and restarts/rescaling.
If you want to read from a checkpoint yourself, that's what the State Processor API is for. Here's an example.

Initialize Flink Job

We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data.
We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.
The State Processor API is a good solution. It provides a sort of savepoint connector that you use in a DataSet job to read/modify/update the savepoints that you use in your DataStream jobs.
It's a pretty simple change (definitely not "significant") to support a -preload mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload option.

Checkpoint Shared is Very Large

I running my flink app with 16 parallelism. after 20 minutes shared checkpoint increase to 235MB. how i can i handle it. it's very large in long time.
Every task manager is a Openshift Pod
Task managers: 4
Tasks per task manager: 4
CPU per task manager: 4 Core
Memory per task manager: 6GB
Used Rocksdb state bakend
Enabled incremental checkpoint
Below image is for a task manager(Pod)
Flink will use only as much space for state as is required to do what you've asked it to do. If you are unhappy with the result, you need to somehow ask it to do less.
Here some things you might do:
Make sure your application isn't leaking state. This can happen, for example, if you are using keyed state with an unbounded key space, and aren't clearing the state.
Establish a state retention interval (for the Table/SQL API).
Use State TTL to free unneeded state.
There are certain anti-patterns that require a lot of buffering in state. You should avoid those. :)
You could restrict the resources available for storing state, but this will result in the job failing when those resources are exhausted.
Also, 235MB across 16 slots isn't very large for RocksDB. With incremental checkpointing, RocksDB is storing multiple (uncompacted) copies of the state. The actual active state you're using could be much less.

Flink keyed state clean up for incremental rocksdb checkpointing

We have a flink job that would persist large keyed state in rocksdb backend. We are using incremental checkpointing strategy. As time goes by, the size of the state become a problem. We have checked the state ttl solution but it does not support incremental rocksdb states.
What would be the best approach for this problem if I really need incremental checkpoint?
One approach that is often used is to manipulate the state in some kind of ProcessFunction, and use a timer to clear the state when it is no longer needed -- e.g., if it hasn't been accessed for several hours. ProcessFunctions are able to have both event-time and processing-time timers, so you can choose whichever is more appropriate for your use case.
See the expiring state exercise on the Flink training site for an example of using timers to clear state.

Resources