State size more than 500Gb in Flink - apache-flink

I am wondering that it is okay to use in case where state size is more than 500Gb in the state backend in Flink.
RocksDB can handle the data more than memory size, but searching the data in huge size is tremendous I think. I know that caching some data in RocksDB can reduce the execution time, but wondering that such a scenario that storing large data in state exist in use case in Flink.
My scenario looks like as below.
I would use the MapState for state. My data stored in the state will be key-value pair. And each incoming records will be preprocessed by using config value and then will be appended to the state.
Thanks.

If you can use incremental checkpointing, and if the amount of state that changes per checkpoint interval is something significantly less than 500Gb, then yes that can work.
With the RocksDB state backend, MapState is especially efficient, as changing/adding/deleting an entry doesn't impact any other state in the map.

Related

Wha tif the size of state is larger than the flink memory size?

I am wondering that the size of state is larger than the flink's memory size.
Since the state is controlled by the Flink App's APIs by defining MapState<K,V> in the code level, the state is possible to store large size of values (which is over than memory size such as 100Gb,200Gb).
Can it be possible?
You might be interested in reading about State Backends
The HashMapStateBackend holds data internally as objects on the Java heap
HashMapStateBackend will OOM your task managers if your MapStates are too big.
The EmbeddedRocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager local data directories
[...] Note that the amount of state that you can keep is only limited by the amount of disk space available. This allows keeping very large state, compared to the HashMapStateBackend that keeps state in memory. This also means, however, that the maximum throughput that can be achieved will be lower with this state backend. All reads/writes from/to this backend have to go through de-/serialization to retrieve/store the state objects, which is also more expensive than always working with the on-heap representation as the heap-based backends are doing.
EmbeddedRocksDBStateBackend will use the disk, so you have more capacity. Note that it is slower, but that caches could help alleviate some of that slowness; the configuration of which I suggest you look at (in Flink using RocksDB's mecanism)

Iterating over list state with millions of records in Flink

I want to store all the CDC records in list state and streams those records to respective sinks once trigger message is received.
The list state can grow up-to a million records, will iteration over the list state in KeyedProcessFunction causes memory issues? Planning to use RocksDB state backend to store the state. What is the correct way of streaming the list state in this case?
Regarding memory usage of ListState this answer explains how memory is used with the RocksDB state backend: https://stackoverflow.com/a/66622888/19059974
It seems that the whole list will need to fit into heap, so depending on the size of your elements, it could take a lot of memory.
Ideally, you would want to key the state into smaller partitions, so it can be spread when increasing the task parallelism. Alternatively, a workaround could be to use a MapState which seems to not load all its contents into memory when iterating over the map. It will use more storage than a ListState and most likely appending won't be as fast, but should allow you to iterate over it using less memory.

Flink app's checkpoint size keeps growing

I have a pipeline like this:
env.addSource(kafkaConsumer, name_source)
.keyBy { value -> value.f0 }
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.process(MyProcessor())
.addSink(kafkaProducer)
The keys are guaranteed to be unique in the data that is being currently processed.
Thus I would expect the state size to not grow over 2 seconds of data.
However, I notice the state size has been steadily growing over the last day (since the app was deployed).
Is this a bug in flink?
using flink 1.11.2 in aws kinesis data analytics.
Kinesis Data Analytics always uses RocksDB as its state backend. With RocksDB, dead state isn't immediately cleaned up, it's merely marked with a tombstone and is later compacted away. I'm not sure how KDA configures RocksDB compaction, but typically it's done when a level reaches a certain size -- and I suspect your state size is still small enough that compaction hasn't occurred.
With incremental checkpoints (which is what KDA does), checkpointing is done by copying RocksDB's SST files -- which in your case are presumably full of stale data. If you let this run long enough you should eventually see a significant drop in checkpoint size, once compaction has been done.

flink - how to use state as cache

I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.

Apache Flink: How often is state de/serialized?

How frequently does Flink de/serialise operator state? Per get/update or based on checkpoints? Does the state backend make a difference?
I suspect that in the case of a keyed-stream with a diverse key (millions) and thousands of events per second for each key, the de/serialization might be a big issue. Am I right?
Your assumption is correct. It depends on the state backend.
Backends that store state on the JVM heap (MemoryStateBackend and FSStateBackend) do not serialize state for regular read/write accesses but keep it as objects on the heap. While this leads to very fast accesses, you are obviously bound to the size of the JVM heap and also might face garbage collection issues. When a checkpoint is taken, the objects are serialized and persisted to enable recovery in case of a failure.
In contrast, the RocksDBStateBackend stores all state as byte arrays in embedded RocksDB instances. Therefore, it de/serializes the state of a key for every read/write access. You can control "how much" state is serialized by choosing an appropriate state primitive, i.e., ValueState, ListState, MapState, etc.
For example, ValueState is always de/serialized as a whole, whereas a MapState.get(key) only serializes the key (for the lookup) and deserializes the returned value for the key. Hence, you should use MapState<String, String> instead of ValueState<HashMap<String, String>>. Similar considerations apply for the other state primitives.
The RocksDBStateBackend checkpoints its state by copying their files to a persistent filesystem. Hence, there is no additional serialization involved when a checkpoint is taken.

Resources