We are using Flink streaming to run a few jobs on a single cluster. Our jobs are using rocksDB to hold a state.
The cluster is configured to run with a single Jobmanager and 3 Taskmanager on 3 separate VMs.
Each TM is configured to run with 14GB of RAM.
JM is configured to run with 1GB.
We are experiencing 2 memory related issues:
- When running Taskmanager with 8GB heap allocation, the TM ran out of heap memory and we got heap out of memory exception. Our solution to this problem was increasing heap size to 14GB. Seems like this configuration solved the issue, as we no longer crash due to out of heap memory.
- Still, after increasing heap size to 14GB (per TM process) OS runs out of memory and kills the TM process. RES memory is rising over time and reaching ~20GB per TM process.
1. The question is how can we predict the maximal total amount of physical memory and heap size configuration?
2. Due to our memory issues, is it reasonable to use a non default values of Flink managed memory? what will be the guideline in such case?
Further details:
Each Vm is configured with 4 CPUs and 24GB of RAM
Using Flink version: 1.3.2
The total amount of required physical and heap memory is quite difficult to compute since it strongly depends on your user code, your job's topology and which state backend you use.
As a rule of thumb, if you experience OOM and are still using the FileSystemStateBackend or the MemoryStateBackend, then you should switch to RocksDBStateBackend, because it can gracefully spill to disk if the state grows too big.
If you are still experiencing OOM exceptions as you have described, then you should check your user code whether it keeps references to state objects or generates in some other way large objects which cannot be garbage collected. If this is the case, then you should try to refactor your code to rely on Flink's state abstraction, because with RocksDB it can go out of core.
RocksDB itself needs native memory which adds to Flink's memory footprint. This depends on the block cache size, indexes, bloom filters and memtables. You can find out more about these things and how to configure them here.
Last but not least, you should not activate taskmanager.memory.preallocate when running streaming jobs, because streaming jobs currently don't use managed memory. Thus, by activating preallocation, you would allocate memory for Flink's managed memory which is reduces the available heap space.
Using RocksDBStateBackend can lead to significant off-heap/direct memory consumption, up to the available memory on the host. Normally that doesn't cause a problem, when the task manager process is the only big memory consumer. However, if there are other processes with dynamically changing memory allocations, it can lead to out of memory. I came across this post since I'm looking for a way to cap the RocksDBStateBackend memory usage. As of Flink 1.5, there are alternative option sets available here. It appears though that these can only be activated programmatically, not via flink-conf.yaml.
Related
I am trying to understand what is going on in a basic flink stream job.
I have a basic flink job which reads from kafka as string, in a flatmap makes some string operations and sends messages back to kafka as string. No window or state.
1 kafka source and 8 kafka sinks.
Message count is about 25K/second.
When I check the task manager on Flink UI, I see that heap memory usage sometimes goes up to 10G. When I watch it I see that it changes something between 3GB-10GB.
I configured heap memory to 500MB and it seems it is working without any problem. heap usage is around 200-400MB
If I configure heap memory to 10GB , then heap usage metric of task manager on Flink UI is differs between 3-10GB.
I took a heap dump of task manager and analyzed it Visual Vm. It shows heap usage only 23MB. Why?
I enabled GC logs and where you can see a sample :
Memory usage stats: [HEAP: 3851/9208/9208 MB, NON HEAP: 84/89/744 MB (used/committed/max)]
So where is this 5GB-6GB . For just a basic stream job, why flink is using this much memory?
The real problem is, when I deploy the other jobs, after a while, taskmanager crushes without any trace so I assume I have an OOM error.
I have an AWS RDS Aurora (PostgreSQL compatible) instance, which recently triggered an alert because of increased swap usage, which was caused by running some not optimized queries (big temporary tables and sequential scanning). Some basic AWS metrics looks like:
Blue line: freeable memory
Purple line: swap usage
Yellow line: freeable - swap
I have a few questions I could not find an answer, nowhere in AWS docs, forums nor on SO
Why the DB started allocating swap while it still had a lot of freeable memory?
Why it's not releasing the swap if it's no longer used? How to reduce the amount of used swap?
Why also adds to freeable memory?
You can find more details about the RDS swap memory in the AWS Knowledgebase: https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-rds-swap-memory/.
Swap Memory is an essential part of the OS, which helps to extend the memory size by storing additional data in a DISK. When more memory is allocated, old contents of the RAM is written to the Swap location in the DISK and new contents is placed in the RAM. In this case, it indicates that a new query/a set of queries are executed which fetched/scanned more records thus more RAM is required. So the OS made room by moving some old data to Swap.
As per the KB, below is the reason for swap memory not going down,
Linux swap usage isn't cleared frequently, because clearing the swap
usage requires extra overhead to reallocate swap when it's needed and
when reloading pages. As a result, if swap space is used on your RDS
DB instance, even if swap space was used only one time, the SwapUsage
metrics don't return to zero.
Postgres caches the result of previous executions in RAM so that it can reduce the disk seek next time. You can improving the database performance by allocating sufficient buffer cache. This is an expected behavior. The size of this cache is configurable. Please refer: https://redfin.engineering/how-to-boost-postgresql-cache-performance-8db383dc2d8f
Also as mentioned in the KB, this could be due to queries that returns huge amount of record, or a load on the database. You can enable performance insights to get more details about the queries that are running during that time.
BTW, Performance insights may not be available in smaller RDS instances. In that case, you can look in to the binary logs to see which queries where executed. Also enabling slow query logs will help you.
I was going through lot of blogs and stack overflow answers , but i am not clear about the Flink memory management. In few blogs i found "Memory Manager Pool" and "Rocksdb". I am using rocksdb and i assume all my state is stored in that db.
Here are my doubts..
How the memory management process handled in streaming ?
what is difference between Memory management in streaming and batch ?
Difference between "Memory Manager Pool" and "back end state (Rcokdb")
In streaming, what you mean by "Flink Managed Memory" ? does include the memory required by RacksDb cache and buffers ?
Streaming
When you use RocksDBStatebackend all KeyedState (ValueState, MapState, ... and Timers) is stored in RocksDB. OperatorState is kept on the Heap. OperatorState is usually very small, and seldomly used directly by a Flink developer.
For Flink 1.10+, managed memory includes all memory used by RocksDB. Flink makes sure that RocksDB's memory usage stays within the limits of the assigned managed memory. Use taskmanager.memory.managed.fraction to tune how much memory you give to RocksDB. Usually, you can give all memory but 500MB to RockSDB.
Batch
Batch Programs do not use a Statebackend. Managed memory is used for off-heap joins, sorting, etc. The memory configurations like taskmanager.memory.managed.fraction are the same for batch and streaming.
As per Flink documents memory management in Streaming and batch handled differently
We are using AppDynamics and VisualVM to monitor our application heap memory usage. We see similar graph as stated in these questions - this and this.
the red boxes show idle system heap usage - peaks are seen only when system is in idle state and are even observed when no application is deployed.
the green arrow points to actual application in use state - When system is in use, we see relatively very less heap usage being reported.
Based on the clarifications in other SO questions, if we say it is due to garbage collection, why would GC not occur during application use? When system is idle, we see system objects like java.land.String, byte[], int[] etc. getting reported in AppDynamics, but how to find who is responsible for creating them?
Again, in the heap dumps taken during idle state, we see only 200MB out of 500MB memory used, when the server has dedicated -Xmx4g configuration.
How should we make sense of these observations?
On analyzing the heap dump taken during system idle state, we only see various WebAppClassLoaders holding instances of different library classes.
This pattern is also explained in official blogs of APM experts like Plumbr and Datadog as a sign of healthy JVM where regular GC activity is occurring and they explain that it means none of the objects will stay in memory forever.
From Plumbr blog:
Seeing the following pattern is a confirmation that the JVM at question is definitely not leaking memory.
The reason for the double-sawtooth pattern is that the JVM needs to allocate memory on the heap as new objects are created as a part of the normal program execution. Most of these objects are short-lived and quickly become garbage. These short-lived objects are collected by a collector called “Minor GC” and represent the small drops on the sawteeth.
I am measuring memory usage for an application (WordCount) in Flink with ps -p TaskManagerPID -o rss. However the results don’t make any sense. Because for every amount of data (1MB, 10MB, 100MB, 1GB, 10GB) there is the same amount of memory used. For 10GB data the result of the measurement is even less than 10GB. Is TaskManager the wrong process for measuring memory usage? Which process of the Flink Process Model is responsible for memory allocation?
Flink features two processing modes, stream and batch processing.
Stream Processing:
In stream processing, Flink uses pluggable state backends to maintain the state of an applications. In Flink version 1.5.0, there are two types of state backends. 1) backends (FsStateBackend and MemoryStateBackend) that store the application state on the heap of the worker (TaskManager) JVM process and 2) the RocksDBStateBackend that stores the state in RocksDB on disk. In both cases, you can monitor the memory consumption using regular JVM memory monitoring tools. However, for the RocksDBStateBackend most of the state will be stored on disk.
Batch Processing
The internal processing algorithms (sorting, hash tables) of the batch processing operators work with managed memory that is (typically) pre-allocated when a worker processes (TaskManager) starts and never returned. Flink assigns this managed memory to its algorithms and the algorithms spill to disk if the amount of data exceeds their memory budget. Since all memory is pre-allocated and internally managed by Flink, it is not possible to measure the actual memory consumption.