I am measuring memory usage for an application (WordCount) in Flink with ps -p TaskManagerPID -o rss. However the results don’t make any sense. Because for every amount of data (1MB, 10MB, 100MB, 1GB, 10GB) there is the same amount of memory used. For 10GB data the result of the measurement is even less than 10GB. Is TaskManager the wrong process for measuring memory usage? Which process of the Flink Process Model is responsible for memory allocation?
Flink features two processing modes, stream and batch processing.
Stream Processing:
In stream processing, Flink uses pluggable state backends to maintain the state of an applications. In Flink version 1.5.0, there are two types of state backends. 1) backends (FsStateBackend and MemoryStateBackend) that store the application state on the heap of the worker (TaskManager) JVM process and 2) the RocksDBStateBackend that stores the state in RocksDB on disk. In both cases, you can monitor the memory consumption using regular JVM memory monitoring tools. However, for the RocksDBStateBackend most of the state will be stored on disk.
Batch Processing
The internal processing algorithms (sorting, hash tables) of the batch processing operators work with managed memory that is (typically) pre-allocated when a worker processes (TaskManager) starts and never returned. Flink assigns this managed memory to its algorithms and the algorithms spill to disk if the amount of data exceeds their memory budget. Since all memory is pre-allocated and internally managed by Flink, it is not possible to measure the actual memory consumption.
Related
I am using state for processing the data using Flink.
I used the rocksDB because the data size to be stored is relatively large compared to the memory size.
I set up the rocksdb configuration and running my flink app during several hours.
I expected that the job runs normally without any errors, but I found the job manager did not take the heartbeat from the task manager.
When I monitored the memory metrics, the off-heap in task manager is growing and the job is died.
I know that rocksdb is best choice for storing large objects in stream processing, but in my case it was not achieved.
For these reasons, I would like to know that the precise time when the flink flush the data into disk level in rocks db. And also, I would appreciate it if you could give any guidance for configuration setup for rocksDb.
For my case, I configured the below setups regarding rocksdb and others are default value.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.memory.fixed-per-slot: 924m
state.backend.rocksdb.memory.managed: false
state.checkpoints.dir: s3://nxflow-bucket/checkpoints
state.checkpoints.num-retained: 1
state.savepoints.dir: s3://nxflow-bucket/savepoints
And you can see that the off heap memory is growing as shown in a figure below.
Is there any setup for flushing time for disk level in rocksdb?
And also, my job is really using the disk for statebackend? If true, why my job is died?
Thanks.
One obvious advantage of using the EmbeddedRocksDBStateBackend with Flink is that it can spill to disk when there isn't enough memory. But if I'm prepared to give it enough memory so that it never needs to use the disk, how is that different from using the HashMapStateBackend?
These are the major differences:
The serialized format in which the RocksDB state backend maintains state has (much) less overhead (in general) than the binary object format used on the heap. So for a given amount of memory, RocksDB can hold more state.
The ser/de overhead in RocksDB means that that backend has significantly less throughput (on average).
The RocksDB backend maintains its state in off-heap memory, whereas state kept on the heap is subject to GC overhead and pauses. So RocksDB may have better worst-case latency. (Once Flink supports Java 17 and its modern garbage collectors, this factor may disappear.)
The RocksDB backend supports incremental checkpointing, which can significantly speed up both snapshots and restores (but see FLIP-151).
FWIW, some users choose to deploy with RocksDB configured to use a RAM disk as the local disk.
I was going through lot of blogs and stack overflow answers , but i am not clear about the Flink memory management. In few blogs i found "Memory Manager Pool" and "Rocksdb". I am using rocksdb and i assume all my state is stored in that db.
Here are my doubts..
How the memory management process handled in streaming ?
what is difference between Memory management in streaming and batch ?
Difference between "Memory Manager Pool" and "back end state (Rcokdb")
In streaming, what you mean by "Flink Managed Memory" ? does include the memory required by RacksDb cache and buffers ?
Streaming
When you use RocksDBStatebackend all KeyedState (ValueState, MapState, ... and Timers) is stored in RocksDB. OperatorState is kept on the Heap. OperatorState is usually very small, and seldomly used directly by a Flink developer.
For Flink 1.10+, managed memory includes all memory used by RocksDB. Flink makes sure that RocksDB's memory usage stays within the limits of the assigned managed memory. Use taskmanager.memory.managed.fraction to tune how much memory you give to RocksDB. Usually, you can give all memory but 500MB to RockSDB.
Batch
Batch Programs do not use a Statebackend. Managed memory is used for off-heap joins, sorting, etc. The memory configurations like taskmanager.memory.managed.fraction are the same for batch and streaming.
As per Flink documents memory management in Streaming and batch handled differently
I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.
We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state.
The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs.
Each TM is configured to run with 14GB of RAM.
JM is configured to run with 1GB.
We are experiencing 2 memory related issues:
- When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory.
- Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.
1. The question is how can we predict the maximal total amount of physical memory and heap size configuration?
2. Due to our memory issues, is it reasonable to use a non default values of Flink managed memory? what will be the guideline in such case?
Further details:
Each Vm is configured with 4 CPUs and 24GB of RAM
Using Flink version: 1.3.2
The total amount of required physical and heap memory is quite difficult to compute since it strongly depends on your user code, your job's topology and which state backend you use.
As a rule of thumb, if you experience OOM and are still using the FileSystemStateBackend or the MemoryStateBackend, then you should switch to RocksDBStateBackend, because it can gracefully spill to disk if the state grows too big.
If you are still experiencing OOM exceptions as you have described, then you should check your user code whether it keeps references to state objects or generates in some other way large objects which cannot be garbage collected. If this is the case, then you should try to refactor your code to rely on Flink's state abstraction, because with RocksDB it can go out of core.
RocksDB itself needs native memory which adds to Flink's memory footprint. This depends on the block cache size, indexes, bloom filters and memtables. You can find out more about these things and how to configure them here.
Last but not least, you should not activate taskmanager.memory.preallocate when running streaming jobs, because streaming jobs currently don't use managed memory. Thus, by activating preallocation, you would allocate memory for Flink's managed memory which is reduces the available heap space.
Using RocksDBStateBackend can lead to significant off-heap/direct memory consumption, up to the available memory on the host. Normally that doesn't cause a problem, when the task manager process is the only big memory consumer. However, if there are other processes with dynamically changing memory allocations, it can lead to out of memory. I came across this post since I'm looking for a way to cap the RocksDBStateBackend memory usage. As of Flink 1.5, there are alternative option sets available here. It appears though that these can only be activated programmatically, not via flink-conf.yaml.