Flink: Reading from Kinesis causes ReadProvisionedThroughputExceeded

Flink: Reading from Kinesis causes ReadProvisionedThroughputExceeded - apache-flink

I've got a Flink app with a Kinesis source. I'm seeing a lot of ReadProvisionedThroughputExceeded errors from AWS when running the Flink app. I've tried updating the consumer config with different settings to reduce the number of get record calls and increase time between calls but that doesn't seem to help:
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_MAX, "500")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_INTERVAL_MILLIS, "30000")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_BACKOFF_BASE, "3000")
consumerConfig.put(ConsumerConfigConstants.SHARD_GETRECORDS_BACKOFF_MAX, "10000")
Are there other setting that I should be tuning? Thanks!

Try to do the following checks:
Check Kinesis monitor to get which metric exceeds limit: number of records or sum of bytes for all records for each poll.
Competitive consumers. Are there any other consumers reading from the same shard? If enhanced-fan-out is not enabled, then they will share the same throughputs.
Enable enhance-fan-out(https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#using-enhanced-fan-out) is a potential solution.

Turns out it's a bug: https://issues.apache.org/jira/browse/FLINK-21661 in the version of I'm using - Flink 1.12

Related

Flink Statefun Under Backpressure application crashes

I'm reading data from a kafka topic which has lots of data. Once flink starts reading, it starts up fine and then crashes after some time, when backpressure hits 100%, and then goes in an endless cycle of restarts.
My question is shouldn't flink's backpressure mechanism come into play and reduce consumption from topic till inflight data is consumed or backpressure reduces, as stated in this article: https://www.ververica.com/blog/how-flink-handles-backpressure? Or do i need to give some config which I'm missing? Is there any other solution to avoid this restart cycle when backpressure increases?
I've tried configs
taskmanager.network.memory.buffer-debloat.enabled: true
taskmanager.network.memory.buffer-debloat.samples: 5
My modules.yaml has this config for transportation
spec:
functions: function_name
urlPathTemplate: http://nginx:8888/function_name
maxNumBatchRequests: 50
transport:
timeouts:
call: 2 min
connect: 2 min
read: 2 min
write: 2 min

You should look in the logs to determine what exactly is causing of the crash and restart, but typically when backpressure is involved in a restart it's because a checkpoint timed out. You could increase the checkpoint timeout.
However, it's better if you can reduce/eliminate the backpressure.
One common cause of backpressure is not providing Flink with enough resources to keep up with the load. If this is happening regularly, you should consider scaling up the parallelism. Or it may be that the egress is under-provisioned.
I see you've already tried buffer debloating (which should help). You can also try enabling unaligned checkpoints.
See https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpointing_under_backpressure/ for more information.

Lantency Monitoring in Flink 1.14

I am following this Flink tutorial for reactive scaling and am interested in knowing how overall end-to-end latencies are affected by such rapid changes in the number of worker nodes. As per the documentation, I have added metrics.latency.interval: 1000 to the config map with the understanding that a new latency metric will be added with markers being sent every 1 second. However, I cannot seem to find the corresponding histogram in prometheus. Listed below are this which are available metrics associated with latency:
I am using Flink 1.14. Is there something which I am missing?

I am suspecting that something happened to the latency metric between releases 1.13.2 and 1.14. Per now, I am not able to see the latency metrics from Flink after migration to 1.14, despite setting the latency interval to a positive number. Have you tried 1.13.2?

.. further exploration lead me to believe it is the migration to the KafkaSource / KafkaSink classes, as opposed to the deprecated FlinkKafkaConsumer and FlinkKafkaProducer that actually made the latency metric disappear. Currently, I am seeing the latency measures on flink 1.14, however using the deprecated Kafka source / sinks..

Flink message retries like Storm

I am trying to build a Flink job that would read data from a Kafka source do a bunch of processing including few REST calls and then finally sink into another Kafka topic.
The problem I trying to address is that of message retries. What if there are transient errors in the REST API? How can I do exponential backoff-based retry of these messages like the way Storm supports?
I have 2 approaches that I could think off
Use TimerService but then in case of failures the state will start to expand uncontrollably.
Write failed message to a different Kafka topic and process them with a delay of sorts, but here the problem can arise if the Sink itself is down for few minutes?
Is there a better more robust and simpler way to achieve this?

I would use Flink's AsyncFunction to make the REST calls. If needed, it will backpressure the source(s) rather than use more than a configured amount of state. For retries, see AsyncFunction retries.

Is Flink aware of the Kafka partition addition during runtime

I have a question aboult Flink-Kafka Source:
When a flink application starts up after restored from checkpoint, and runs well.
During running, serveral Kafka partitions are added to the Kafka topic, will the running flink application be aware of these added partitions and read them without manual effort? or I have to restart the application and let flink be aware of these partition during startup?
Could you please point to me the code where Flink handles Kafka partitions change if adding partitions doesn't need manual effort. I didn't find the logic in the code.
Thanks！

Looks that Flink will be aware of new topic and new partition during runtime,the method call sequence is:
FlinkKafkaConsumerBase#run
FlinkKafkaConsumerBase#runWithPartitionDiscovery
FlinkKafkaConsumerBase#createAndStartDiscoveryLoop
It the last method, it will kick off a new thread to discover new topics/partitions periodically

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.

Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.