How frequently does Flink de/serialise operator state? Per get/update or based on checkpoints? Does the state backend make a difference?
I suspect that in the case of a keyed-stream with a diverse key (millions) and thousands of events per second for each key, the de/serialization might be a big issue. Am I right?
Your assumption is correct. It depends on the state backend.
Backends that store state on the JVM heap (MemoryStateBackend and FSStateBackend) do not serialize state for regular read/write accesses but keep it as objects on the heap. While this leads to very fast accesses, you are obviously bound to the size of the JVM heap and also might face garbage collection issues. When a checkpoint is taken, the objects are serialized and persisted to enable recovery in case of a failure.
In contrast, the RocksDBStateBackend stores all state as byte arrays in embedded RocksDB instances. Therefore, it de/serializes the state of a key for every read/write access. You can control "how much" state is serialized by choosing an appropriate state primitive, i.e., ValueState, ListState, MapState, etc.
For example, ValueState is always de/serialized as a whole, whereas a MapState.get(key) only serializes the key (for the lookup) and deserializes the returned value for the key. Hence, you should use MapState<String, String> instead of ValueState<HashMap<String, String>>. Similar considerations apply for the other state primitives.
The RocksDBStateBackend checkpoints its state by copying their files to a persistent filesystem. Hence, there is no additional serialization involved when a checkpoint is taken.
Related
I am wondering that it is okay to use in case where state size is more than 500Gb in the state backend in Flink.
RocksDB can handle the data more than memory size, but searching the data in huge size is tremendous I think. I know that caching some data in RocksDB can reduce the execution time, but wondering that such a scenario that storing large data in state exist in use case in Flink.
My scenario looks like as below.
I would use the MapState for state. My data stored in the state will be key-value pair. And each incoming records will be preprocessed by using config value and then will be appended to the state.
Thanks.
If you can use incremental checkpointing, and if the amount of state that changes per checkpoint interval is something significantly less than 500Gb, then yes that can work.
With the RocksDB state backend, MapState is especially efficient, as changing/adding/deleting an entry doesn't impact any other state in the map.
I am wondering that the size of state is larger than the flink's memory size.
Since the state is controlled by the Flink App's APIs by defining MapState<K,V> in the code level, the state is possible to store large size of values (which is over than memory size such as 100Gb,200Gb).
Can it be possible?
You might be interested in reading about State Backends
The HashMapStateBackend holds data internally as objects on the Java heap
HashMapStateBackend will OOM your task managers if your MapStates are too big.
The EmbeddedRocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager local data directories
[...] Note that the amount of state that you can keep is only limited by the amount of disk space available. This allows keeping very large state, compared to the HashMapStateBackend that keeps state in memory. This also means, however, that the maximum throughput that can be achieved will be lower with this state backend. All reads/writes from/to this backend have to go through de-/serialization to retrieve/store the state objects, which is also more expensive than always working with the on-heap representation as the heap-based backends are doing.
EmbeddedRocksDBStateBackend will use the disk, so you have more capacity. Note that it is slower, but that caches could help alleviate some of that slowness; the configuration of which I suggest you look at (in Flink using RocksDB's mecanism)
I have fairly large broadcast state (about 62MB when serialized as state). I noticed that each instance of my operator is saving a copy of this state during checkpointing. With a parallelism of 400, that's 24gb of checkpoint state, most of it duplicated.
This matches the description of Important Considerations in the docs. On the other hand, Checkpointing under backpressure says:
Broadcast partitioning is often used to implement a broadcast state which should be equal across all operators. Flink implements the broadcast state by checkpointing only a single copy of the state from subtask 0 of the stateful operator. Upon restore, we send that copy to all of the operators. Therefore it might happen that an operator will get the state with changes applied for a record that it will soon consume from its checkpointed channels.
The bit about "checkpointing only a singe copy of the state from subtask 0" doesn't match what I'm seeing, hoping someone can clarify.
Regardless...is there any typical workaround for this? For example, I could set up my TMs with one slot (even though they have 8 cores), and then use a thread pool to process incoming non-broadcast elements. This would reduce by 8x the parallelism of the operator. Assuming I deal with concurrency issues (threads accessing state while it's being updated), what other issues are there? E.g. can the collector be saved & then safely called asynchronously by a thread? I don't have watermarks, but wondering about things like checkpoint barriers.
Or I could bail on using a broadcast stream, and replicate the data myself (with carefully constructed keys), but that's also a helicopter stunt.
The bit about "checkpointing only a single copy of the state from subtask 0" is incorrect (I verified this with the author of that sentence). In the current implementation of BroadcastState all operators snapshot their state.
I'm afraid that doesn't help answer your real question, but hopefully clarifies the situation.
I want to store all the CDC records in list state and streams those records to respective sinks once trigger message is received.
The list state can grow up-to a million records, will iteration over the list state in KeyedProcessFunction causes memory issues? Planning to use RocksDB state backend to store the state. What is the correct way of streaming the list state in this case?
Regarding memory usage of ListState this answer explains how memory is used with the RocksDB state backend: https://stackoverflow.com/a/66622888/19059974
It seems that the whole list will need to fit into heap, so depending on the size of your elements, it could take a lot of memory.
Ideally, you would want to key the state into smaller partitions, so it can be spread when increasing the task parallelism. Alternatively, a workaround could be to use a MapState which seems to not load all its contents into memory when iterating over the map. It will use more storage than a ListState and most likely appending won't be as fast, but should allow you to iterate over it using less memory.
When thinking about the act of keying by something I traditionally think of the analogy of throwing all the events that match the key into the same bucket. As you can imagine, when the Flink application starts handling lots of data what you opt to key by starts to become important because you want to make sure you clean up state well. This leads me to my question, how exactly does Flink clean up these "buckets"? If the bucket is empty (all the MapStates and ValueStates are empty) does Flink close that area of the key space and delete the bucket?
Example:
Incoming Data Format: {userId, computerId, amountOfTimeLoggedOn}
Key: UserId/ComputerId
Current Key Space:
Alice, Computer 10: Has 2 events in it. Both events are stored in state.
Bob, Computer 11: Has no events in it. Nothing is stored in state.
Will Flink come and remove Bob, Computer 11 from the Key Space eventually or does it just live on forever because at one point it had an event in it?
Flink does not store any data for state keys which do not have any user value associated with them, at least in the existing state backends: Heap (in memory) or RocksDB.
The Key Space is virtual in Flink, Flink does not make any assumptions about which concrete keys can potentially exist. There are no any pre-allocated buckets per key or subset of keys. Only once user application writes some value for some key, it occupies storage.
The general idea is that all records with the same key are processed on the same machine (somewhat like being in the same bucket as you say). The local state for a certain key is also always kept on the same machine (if stored at all). This is not related to checkpoints though.
For your example, if some value was written for [Bob, Computer 11] at some point of time and then subsequently removed, Flink will remove it completely with the key.
Short Answer
It cleans up with the help of Time To Live (TTL) feature of Flink State and Java Garbage Collector (GC). TTL feature will remove any reference to the state entry and GC will take back the allocated memory.
Long Answer
Your question can be divided into 3 sub-questions:
I will try to be as brief as possible.
How does Flink partition the data based on Key?
For an operator over a keyed stream, Flink partitions the data on a key with the help of Consistent Hashing Algorithm. It creates max_parallelism number of buckets. Each operator instance is assigned one or more of these buckets. Whenever a datum is to be sent downstream, the key is assigned to one of those buckets and consequently sent to the concerned operator instance. No key is stored here because ranges are calculated mathematically. Hence no area is cleared or bucket is deleted anytime. You can create any type of key you want. It won't affect the memory in terms of keyspace or ranges.
How does Flink store state with a Key?
All operator instances have an instance-level state store. This store defines the state context of that operator instance and it can store multiple named-state-storages e.g. "count", "sum", "some-name" etc. These named-state-storages are Key-Value stores that can store values based on the key of the data.
These KV stores are created when we initialize the state with a state descriptor in open() function of an operator. i.e. getRuntimeContext().getValueState().
These KV stores will store data only when something is needed to be stored in the state. (like HashMap.put(k,v)). Thus no key or value is stored unless state update methods (like update, add, put) are called.
So,
If Flink hasn't seen a key, nothing is stored for that key.
If Flink has seen the key but didn't call the state update methods, nothing is stored for that key.
If a state update method is called for a key, the key-value pair will be stored in the KV store.
How does Flink clean up the state for a Key?
Flink does not delete the state unless it is required by the user or done by the user manually. As mentioned earlier, Flink has the TTL feature for the state. This TTL will mark the state expiry and remove it when a cleanup strategy is invoked. These cleanup strategies vary wrt backend type and the time of cleanup. For Heap State Backend, It will remove the entry from a state table i.e. removing any reference to the entry. The memory occupied by this non-referenced entry will be cleaned up by Java GC. For RocksDB State Backend, it simply calls the native delete method of RocksDB.