I want to load a large amount of data into Flinks state backend (RocksDB) and process events using this data (for example in a CoProcessFunction). The data is much larger than the memory of a TaskManager.
The data blocks for a key can be very large, which has a negative impact on latency if the data needs to be loaded from the state backend each time. Therefore I would like to keep the data for frequent keys locally in the CoProcessFunction.
However, the total data in the state backend is larger than the memory of a TaskManager so it is not possible to keep the corresponding data block from the state backend locally for each key.
To solve this problem I would need to know the current memory usage of a SubTask to decide if a data block for a key can be kept locally or if something needs to be deleted. So here is my question: Since keys are not clearly assigned to a subtask is there a way to access memory related subtask information or custom metrics related to subtasks in a KeyedStream ? Or is there another way to solve this problem ? (External access via Async I/O is not an option).
The RocksDB block cache is already doing roughly what you describe. Rather than implementing your own caching layer, you should be able to get good results by tuning RocksDB, including giving it plenty of memory to work with.
Using RocksDB State Backend in Apache Flink: When and How is a good starting point, and includes pointers to where you can learn more about the native RocksDB metrics, memory management, etc. I also recommend reading The Impact of Disks on RocksDB State Backend in Flink: A Case Study.
As for your question about accessing subtask metrics from within your Flink job -- I don't know of any way to do that locally. You could, I suppose, implement a Flink source connector that fetches them and streams them into the job as another data source.
Related
I am in the process of creating an ETL and fraud management module using flink to analyze a sequence of real time credit card transactions.
All transactions are received by an exposed API that pushes the data into a Kafka topic.
First, the received data needs to be checked and cleaned, and then stored in a database.
The next step is a fraud analysis of these transactions.
In this first step, with Flink, I have to check in the Card database that the card is known before continuing. The problem is, there are around a billion cards in this database and new card could be added over time.
So I'm not sure if I could cache the entire card number in memory or how to effectively handle this check: Is Flink able to handle some kind of sliding cache to check the card for existence in batch?
What you might do is to mirror the card database into Flink's key-partitioned state, either on-heap, or using RocksDB if you want to this to spill to disk. Key-partitioned state is sharded across the cluster, so if you do want to keep the entire card database in memory, you can scale up the cluster until that's feasible.
To keep only recently seen values, you could rely on state TTL to expire records that haven't been accessed recently.
An alternative: Flink SQL has support for doing streaming lookup joins against JDBC databases, and you can configure caching for that.
I am new to Flink. How to know what can be the production cluster requirements for flink. And how to decide the job memory, task memory and task slots for each job execution in yarn cluster mode.
For ex- I have to process around 600-700 million records each day using datastream as it's a real time data.
There's no one-size-fits-all answer to these questions; it depends. It depends on the sort of processing you are doing with these events, whether or not you need to access external resources/services in order to process them, how much state you need to keep and the access and update patterns for that state, how frequently you will checkpoint, which state backend you choose, etc, etc. You'll need to do some experiments, and measure.
See How To Size Your Apache FlinkĀ® Cluster: A Back-of-the-Envelope Calculation for an in-depth introduction to this topic. https://www.youtube.com/watch?v=8l8dCKMMWkw is also helpful.
I running my flink app with 16 parallelism. after 20 minutes shared checkpoint increase to 235MB. how i can i handle it. it's very large in long time.
Every task manager is a Openshift Pod
Task managers: 4
Tasks per task manager: 4
CPU per task manager: 4 Core
Memory per task manager: 6GB
Used Rocksdb state bakend
Enabled incremental checkpoint
Below image is for a task manager(Pod)
Flink will use only as much space for state as is required to do what you've asked it to do. If you are unhappy with the result, you need to somehow ask it to do less.
Here some things you might do:
Make sure your application isn't leaking state. This can happen, for example, if you are using keyed state with an unbounded key space, and aren't clearing the state.
Establish a state retention interval (for the Table/SQL API).
Use State TTL to free unneeded state.
There are certain anti-patterns that require a lot of buffering in state. You should avoid those. :)
You could restrict the resources available for storing state, but this will result in the job failing when those resources are exhausted.
Also, 235MB across 16 slots isn't very large for RocksDB. With incremental checkpointing, RocksDB is storing multiple (uncompacted) copies of the state. The actual active state you're using could be much less.
Apache Flink allows me to use a State in a RichMapFunction. I am planning to build a continuously running job which analyses a stream of web events. Part of the processing will be the creation of a session context with session scoped metrics (like nth of the session, duration etc) and additionally a user context.
A session context will timeout after 30 minutes, but a user context may exist for a year to handle returning users.
There will be millions of sessions and users so I would end up in millions of individual states. Every state is just a few KB in size.
Is this something that can be handled properly with the Flink states?
How is Flink actually cleaning up deprecated states?
Would it make sense to think about providing a custom backend to store the state in a KV cluster?
For large state I would recommend using Flink's RocksDBStateBackend. This state backend uses RocksDB to store state. Since RocksDB gracefully spills to disk, it is only limited by your available disk space. Thus, Flink should be able to handle your use case.
At the moment you need to register timers to clean up state. However, with the next Flink release, the community will add clean up for state with TTL. This will then automatically clean up your state when it is expired.
Keeping your state close to your computation with periodic checkpoints which are persisted will keep your application fast. If every state access went to a remote KV cluster, it would considerably slow down the processing.
Flink has a MemoryStateBackend and a FsStateBackend (and a RocksDBStateBackend). Both seem to extend the HeapKeyedStateBackend, i.e. the mechanism for storing the current working state is entirely the same.
This SO answer says that the main difference lies in the MemoryStateBackend keeping a copy of the checkpoints in the JobManagers memory. (I wasn't able to glean any evidence for that from the source code.)
The MemoryStateBackend also limits the maximum state size per subtask.
Now I wonder: Why would you ever want to use the MemoryStateBackend?
As you said, both MemoryStateBackend and FSStateBackend are based on HeapKeyedStateBackend. This means, that both state backends maintain the state of an operator as regular objects on the JVM heap of the TaskManager, i.e., state is always accessed in memory.
The backends differ in how they persist the state for checkpoints. A checkpoint is a copy of the state of all operators of an application that is stored somewhere. In case of a failure, the application is restarted and the state of the operators is initialized from the checkpoint.
The FSStateBackend stores the checkpoint in a file system, typically HDFS, S3, or a NFS that is mounted on all worker nodes. The MemoryStateBackend stores the state in the JVM of the JobManager. This has the following pros and cons:
Pros:
No need to setup a (distributed) file system.
No need to configure a storage location.
Cons:
State is lost if the JobManager process dies.
Size of state is bound by the size of the JobManager memory.
Since checkpoints are lost if the JM goes down, the MemoryStateBackend is unsuitable for most production use cases. It can be useful for developing and testing stateful applications, because it requires not configuration or setup.