How to make Flink job with huge state finish - apache-flink

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.

using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

Related

High Flink network buffer usage, which causes Kafka lagging

Our Flink Jobs contains a filter, key by session id and then session window with 30mins gap. The session window will need to accumulate all the event for the session, and process them using ProcessWindowFunction.
We are using Flink 1.9, 128 containers with 20G memory in total to run our job and the cut-off ratio is 0.3.
We are doing incremental checkpoints.
When session windows start to trigger process function, the network buffer usage start getting pretty high, and then we start getting Kafka input lagging.
Our setting:
state.backend: rocksdb
state.checkpoints.dir: hdfs://nameservice0/service
state.backend.rocksdb.memory.managed: true
state.backend.incremental: true
#https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
state.backend.rocksdb.memory.write-buffer-ratio: 0.6
state.backend.rocksdb.memory.high-prio-pool-ratio: 0.1
state.backend.rocksdb.block.blocksize: 16mb
state.backend.rocksdb.writebuffer.count: 8
state.backend.rocksdb.writebuffer.size: 256mb
state.backend.rocksdb.timer-service.factory: heap
containerized.heap-cutoff-ratio: 0.25
taskmanager.network.memory.fraction: 0.85
taskmanager.network.memory.min: 512mb
taskmanager.network.memory.max: 7168mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 4mb
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.netty.transport: poll
Some of the graphs:
Any suggestion will be appreciated!
If I had access to the details, here's what I would look at to try to improve performance for this application:
(1) Could the windows be re-implemented to do incremental aggregation? Currently the windows are building up what may be rather long lists of events, and they only work through those lists when the sessions end. This apparently takes long enough to cause backpressure on Kafka. If you can pre-aggregate the session results, this will even out the processing, and the problem should go away.
And no, I'm not contradicting what I said here. If I haven't been clear, let me know.
(2) You've put a lot of extra network buffering in place. This is usually counter-productive; you want the backpressure to ripple back quickly and throttle the source, rather than pushing more data into Flink's network buffers.
You would do better to reduce the network buffering, and if possible, use your available resources to provide more slots instead. Having more slots will reduce the impact when one slot is busy working through the contents of a long session that just ended. Giving more memory to RocksDB might help too.
(3) See if you can optimize serialization. There can be a 10x difference in throughput between the best and worst serializers. See Flink Serialization Tuning. If there are any fields in the records that you don't actually need, drop them.
(4) Look at tuning RocksDB. Make sure you are using the fastest available local disks for RocksDB, such as local SSDs. Avoid using network attached storage (such as EBS) for state.backend.rocksdb.localdir.
I do not know the internals of the flink but reason could be related to the session window.
What i mean, if you have so many session operations with the same interval(30mins), all session operations will be performed at the same time which can create a delay.

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62

Having consumer issues when RocksDB in flink

I have a job which consumes from RabbitMQ, I was using FS State Backend but it seems that the sizes of states became bigger and then I decide to move my states to RocksDB.
The issue is that during the first hours running the job is fine, event after more time if traffic get slower, but then when the traffic gets high again then the consumer start to have issues (events pilled up as unacked) and then these issues are reflected in the rest of the app.
I have:
4 CPU core
Local disk
16GB RAM
Unix environment
Flink 1.11
Scala version 2.11
1 single job running with few keyedStreams, and around 10 transformations, and sink to Postgres
some configurations
flink.buffer_timeout=50
flink.maxparallelism=4
flink.memory=16
flink.cpu.cores=4
#checkpoints
flink.checkpointing_compression=true
flink.checkpointing_min_pause=30000
flink.checkpointing_timeout=120000
flink.checkpointing_enabled=true
flink.checkpointing_time=60000
flink.max_current_checkpoint=1
#RocksDB configuration
state.backend.rocksdb.localdir=home/username/checkpoints (this is not working don't know why)
state.backend.rocksdb.thread.numfactory=4
state.backend.rocksdb.block.blocksize=16kb
state.backend.rocksdb.block.cache-size=512mb
#rocksdb or heap
state.backend.rocksdb.timer-service.factory=heap (I have test with rocksdb too and is the same)
state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED
Let me know if more information is needed?
state.backend.rocksdb.localdir should be an absolute path, not a relative one. And this setting isn't for specifying where checkpoints go (which shouldn't be on the local disk), this setting is for specifying where the working state is kept (which should be on the local disk).
Your job is experiencing backpressure, meaning that some part of the pipeline can't keep up. The most common causes of backpressure are (1) sinks that can't keep up, and (2) inadequate resources (e.g., the parallelism is too low).
You can test if postgres is the problem by running the job with a discarding sink.
Looking at various metrics should give you an idea of what resources might be under-provisioned.

Flink checkpoints causes backpressure

I have a Flink job processing data at around 200k qps. Without checkpoints, the job is running fine.
But when I tried to add checkpoints (with interval 50mins), it causes backpressue at the first task, which is adding a key field to each entry, the data lagging goes up constantly as well.
the lagging of my two Kafka topics, first half was having checkpoints enabled, lagging goes up very quickly. second part(very low lagging was having checkpoints disabled, where the lagging is within milliseconds)
I am using at least once checkpoint mode, which should be asynchronized process. Could anyone suggest?
My checkpointing setting
env.enableCheckpointing(1800000,
CheckpointingMode.AT_LEAST_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig()
.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.getCheckpointConfig()
.setCheckpointTimeout(10min);
env.getCheckpointConfig()
.setFailOnCheckpointingErrors(
jobConfiguration.getCheckpointConfig().getFailOnCheckpointingErrors());
my job has 128 containers.
With 10mins checkpoint time, following is the stats:
I am trying to use a 30mins checkpoint and see
I was trying to tune memory usage, but it seems not working.
But in the task manager, it's still:
TLDR; it's sometimes hard to analyse the problem. I have two lucky guesses/shots - if you are using RocksDB state backend, you could switch to FsStateBackend - it's usually faster and RocksDB makes most sense with large state sizes, that do not fit into memory (or if you really need incremental checkpointing feature). Second is to fiddle with parallelism, either increasing or decreasing.
I would suspect the same thing that #ArvidHeise wrote. You checkpoint size is not huge, but it's also not trivial. It can add the extra overhead to bring the job over the threshold of barely keeping up with the traffic, to not keeping up and causing the backpressure. If you are under backpressure, the latency will just keep accumulating, so even a change in couple of % of extra overhead can make a difference between end to end latencies of milliseconds vs unbounded ever growing value.
If you can not just simply add more resources, you would have to analyse what's exactly adding this extra over head and what resource is the bottleneck.
Is it CPU? Check CPU usage on the cluster. If it's ~100%, that's the thing you need to optimise for.
Is it IO? Check IO usage on the cluster and compare it against the maximal throughput/number of requests per second that you can achieve.
If both CPU & IO usage is low, you might want to try to increase parallelism, but...
Keep in mind data skew. Backpressure could be caused by a single task and in that case it makes it hard to analyse the problem, as it will be a single bottlenecked thread (on either IO or CPU), not whole machine.
After figuring out what resource is the bottleneck, next question would be why? It might be immediately obvious once you see it, or it might require digging in, like checking GC logs, attaching profiler etc.
Answering those questions could either give you information what you could try to optimise in your job or allow you to tweak configuration or could give us (Flink developers) an extra data point what we could try to optimise on the Flink side.
Any kind of checkpointing adds computation overhead. Most of the checkpointing is asynchronously (as you have stated), but it still adds up general I/O operations. These additional I/O request may, for example, congest your access to external systems. Also if you enable checkpointing, Flink needs to keep track of more information (new vs. already checkpointed).
Have you tried to add more resources to your job? Could you share your whole checkpointing configuration?

Flink Capacity Planning For Large State in YARN Cluster

We have hit a roadblock moving an app at Production scale and was hoping to get some guidance. Application is pretty common use case in stream processing but does require maintaining large number of keyed states. We are processing 2 streams - one of which is a daily burst of stream (normally around 50 mil but could go upto 100 mil in one hour burst) and other is constant stream of around 70-80 mil per hour. We are doing a low level join using CoProcess function between the two keyed streams. CoProcess function needs to refresh (upsert) state from the daily burst stream and decorate constantly streaming data with values from state built using bursty stream. All of the logic is working pretty well in a standalone Dev environment. We are throwing about 500k events of bursty traffic for state and about 2-3 mil of data stream. We have 1 TM with 16GB memory, 1 JM with 8 GB memory and 16 slots (1 per core on the server) on the server. We have been taking savepoints in case we need to restart app for with code changes etc. App does seem to recover from state very well as well. Based on the savepoints, total volume of state in production flow should be around 25-30GB.
At this point, however, we are trying deploy the app at production scale. App also has a flag that can be set at startup time to ignore data stream so we can simply initialize state. So basically we are trying to see if we can initialize the state first and take a savepoint as test. At this point we are using 10 TM with 4 slots and 8GB memory each (idea was to allocate around 3 times estimated state size to start with) but TMs keep getting killed by YARN with a GC Overhead Limit Exceeded error. We have gone through quite a few blogs/docs on Flink Management Memory, off-heap vs heap memory, Disk Spill over, State Backend etc. We did try to tweak managed-memory configs in multiple ways (off/on heap, fraction, network buffers etc) but can’t seem to figure out good way to fine tune the app to avoid issues. Ideally, we would hold state in memory (we do have enough capacity in Production environment for it) for performance reasons and spill over to disk (which I believe Flink should provide out of the box?). It feels like 3x anticipated state volume in cluster memory should have been enough to just initialize state. So instead of just continuing to increase memory (which may or may not help as error is regarding GC overhead) we wanted to get some input from experts on best practices and approach to plan this application better.
Appreciate your input in advance!

Resources