Increasing Parallelism in Flink decreases/splits the overall throughput - apache-flink

My problem is exactly similar to this except that Backpressure in my application is coming as "OK".
I thought the problem was with my local machine not having enough configuration, so I created a 72 core Windows machine, where I am reading data from Kafka, processing it in Flink and then writing the output back in Kafka. I have checked, writing into Kafka Sink is not causing any issues.
All I am looking for are the areas that may be causing a split in Throughput among task slots by increasing parallelism?
Flink Version: 1.7.2
Scala version: 2.12.8
Kafka version: 2.11-2.2.1
Java version: 1.8.231
Working of application: Data is coming from Kafka (1 partition) which is deserialized by Flink (throughput here is 5k/sec). Then the deserialized message is passed through basic schema validation (Throughput here is 2k/sec).
Even after increasing the parallelism to 2, throughput at Level 1 (deserializing stage) remains same and doesn't increase two fold as per expectation.
I understand, without the code, it is difficult to debug so I am asking for the points which you can suggest for this problem, so that I can go back to my code and try that.

We are using 1 Kafka partition for our input topic.
If you want to process data in parallel, you actually need to read data in parallel.
There are certain requirements to read data in parallel. The most important once are that the source is able to actually split the data into smaller work chunks. For example, if you read from a file system, you have multiple files, or the system subdivides the files into splits. For Kafka, this necessarily means that you have to have more partitions. Ideally, you have at least as many partitions than you have max consumer parallelism.
The 5k/s seems to be the maximum throughput that you can achieve on one partition. You can also calculate the number of partitions by the maximum throughput you want to achieve. If you need to achieve 50k/s, you need at least 10 partitions. You should use more to also catch up in case of reprocessing or failure recovery.
Another way to distribute the work is to add a manual shuffle step. That means, if you keep the single input partition, you would still only reach 5k/s, but after that the work is actually redistributed and processed in parallel, such that you will not see a huge decline in your throughput afterwards. After a shuffle operation, work is somewhat evenly distributed among the parallel downstream tasks.

Related

Share data between task slots in Flink JVM memory

I have 5 different jobs running in 5 task slots. They all read from Kafka and sink back to Kafka. Kafka load is about 200K messages/sec.
I have another job, lets say ,job6 which needs to get some information from these 5 jobs. For each device we make some calculations in those 5 jobs, and according the results of this calculations, in the 6. task I need to do something more.
As a first solution, I used sideOutputs in these 5 jobs and sent these additional info to an Kafka topic. Then my 6. job subscribed to it. But as the workload on Kafka was already very high, this solution doubled the workload on Kafka.
As all task slots run in the same task manager JVM, what I have in my mind is , developing custom RichSink and RichSource functions which use same static/singleton java object. As it will be static, I beleive all tasks will have access to same object. This object will keep a queue (java BlockingQueue).Instead of feeding data to Kafka, I will feed this queue in all tasks and 6.task will process the data received from this queue.
Please let me know if this is a good idea for a big distributed system. I assume clusters will not be a problem because after reading data from shared queue, I will call keyBy() so I hope Flink will handle that part. Also please let me know dangereous points and tips if you have.
You essentially have an in-memory data store for bridging between two jobs. One of several issues here is that if the Task Manager crashes, you lose this data, thus eliminating one of the key benefits of Flink (guaranteed at-least-once or exactly-once processing).
You'd also have to ensure that you've got at least one of your job 6 source operators running in a slot on every TM instance. Flink doesn't yet support the ability to easily control which sub-tasks run in what slots, though if you set the downstream job's parallelism == the number of slots then you can work around that issue.
I'm sure there are other issues, I just haven't spent much time thinking about it :)
Depending on the version of Flink you're using, I wonder if Flink's new Table Store would be an option for you.
The GlobalAggregateManager in the Flink may be helpful.
This can be used to share the state amongst parallel tasks in a job. However, performance may be poor in high-throughput scenarios.
Here are some demos of these projects:
Arctic, Flink

Flink checkpoints causes backpressure

I have a Flink job processing data at around 200k qps. Without checkpoints, the job is running fine.
But when I tried to add checkpoints (with interval 50mins), it causes backpressue at the first task, which is adding a key field to each entry, the data lagging goes up constantly as well.
the lagging of my two Kafka topics, first half was having checkpoints enabled, lagging goes up very quickly. second part(very low lagging was having checkpoints disabled, where the lagging is within milliseconds)
I am using at least once checkpoint mode, which should be asynchronized process. Could anyone suggest?
My checkpointing setting
env.enableCheckpointing(1800000,
CheckpointingMode.AT_LEAST_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig()
.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.getCheckpointConfig()
.setCheckpointTimeout(10min);
env.getCheckpointConfig()
.setFailOnCheckpointingErrors(
jobConfiguration.getCheckpointConfig().getFailOnCheckpointingErrors());
my job has 128 containers.
With 10mins checkpoint time, following is the stats:
I am trying to use a 30mins checkpoint and see
I was trying to tune memory usage, but it seems not working.
But in the task manager, it's still:
TLDR; it's sometimes hard to analyse the problem. I have two lucky guesses/shots - if you are using RocksDB state backend, you could switch to FsStateBackend - it's usually faster and RocksDB makes most sense with large state sizes, that do not fit into memory (or if you really need incremental checkpointing feature). Second is to fiddle with parallelism, either increasing or decreasing.
I would suspect the same thing that #ArvidHeise wrote. You checkpoint size is not huge, but it's also not trivial. It can add the extra overhead to bring the job over the threshold of barely keeping up with the traffic, to not keeping up and causing the backpressure. If you are under backpressure, the latency will just keep accumulating, so even a change in couple of % of extra overhead can make a difference between end to end latencies of milliseconds vs unbounded ever growing value.
If you can not just simply add more resources, you would have to analyse what's exactly adding this extra over head and what resource is the bottleneck.
Is it CPU? Check CPU usage on the cluster. If it's ~100%, that's the thing you need to optimise for.
Is it IO? Check IO usage on the cluster and compare it against the maximal throughput/number of requests per second that you can achieve.
If both CPU & IO usage is low, you might want to try to increase parallelism, but...
Keep in mind data skew. Backpressure could be caused by a single task and in that case it makes it hard to analyse the problem, as it will be a single bottlenecked thread (on either IO or CPU), not whole machine.
After figuring out what resource is the bottleneck, next question would be why? It might be immediately obvious once you see it, or it might require digging in, like checking GC logs, attaching profiler etc.
Answering those questions could either give you information what you could try to optimise in your job or allow you to tweak configuration or could give us (Flink developers) an extra data point what we could try to optimise on the Flink side.
Any kind of checkpointing adds computation overhead. Most of the checkpointing is asynchronously (as you have stated), but it still adds up general I/O operations. These additional I/O request may, for example, congest your access to external systems. Also if you enable checkpointing, Flink needs to keep track of more information (new vs. already checkpointed).
Have you tried to add more resources to your job? Could you share your whole checkpointing configuration?

Flink buffer request blocking leads to low performance

As for real-time streaming data platform on flink-1.1.5, I meet an issue.
phenomenal description:
Our business logic of flink job is a generic ETL process which source operator and sink operator are both based on kafka and transform operators are some related etl logic. But the data from kafka topic which is read by source operator are in different size, such as one data is just less than 100 KB in non-peak hours but the other data are nearly 600KB in peak hours. In non-peak session, the performance of flink job is fine, nevertheless it is unsatisfactory in peak session which the performance decrease sharply and sustain the low state until the peak session go by and the performance will recover. As a consequence. I didn`t see any full GC cases happen on taskmanager and the system load of every taskmananger decreased during that period. Besides, there is no exception and extra error in logs. I can see that all threads are blocking at request buffer via JMX at that time.
Related configuration and metircs:
flink standalone cluster on 1.1.5 which has 1 jobmanager and 5 taskmanagers(each one has 18 slot which is less than CPU cores).
jdk version is 1.8 which GC type is G1.
source and sink are both kafka
job consistency type: exactly once
because the source partition amount of kafka topic is less than transform operator amount, the parallelism of each operator is not quite the same and the task can not become one task chain as optimization.
Based on what I describe above, we have tried some approaches:
modify the memory-segment size, 32KB and 64KB
aggrandize the network buffer pool capacity,
modify kafka consumer API parameter
enhance flink cluster which add more taskmanager nodes and add more process threads.
decrease transform operator threads and keep the amount of source operator, all transform operator and sink operator as the same, the task can become one task chain.
We had a try above #1,#2,#3,#4 approach but they does not work, just the #5 was effectual, but the #5 approach was not a excellent manner for us because its performance was declining.
Have you met this issue before? Is there any better approach to avoid this issue?

How to parallel write to sinks in Apache Flink

I have a map DataStream with a parallelism of 8. I add two sinks to the DataStream. One is slow (Elasticsearch) the other one is fast (HDFS). However, my events are only written to HDFS after they have been flushed to ES, so it takes a magnitude longer with ES than it takes w/o ES.
dataStream.setParallelism(8);
dataStream.addSink(elasticsearchSink);
dataStream.addSink(hdfsSink);
It appears to me, that both sinks use the same thread. Is with possible by using the same source with two sinks, or do I have to add another job, one for earch sink, to write the output parallel?
I checked in the logs that Map(1/8) to Map(8/8) are getting deployed and receive data.
If the Elasticsearch sink can not keep up with the speed at which its input is produced it slowdowns its input operator(s). This concept is called backpressure which means that a slow consumer blocks a fast producer from processing.
The only way to make your program behave as you expect (HDFS sink writing faster than Elasticsearch sink) is to buffer all records that the HDFS sink wrote but the Elasticsearch sink hasn't written yet. If the Elasticsearch sink is consistently slower you will run out of memory / disk space at some point in time.
Flink's approach to solve issues with slow consumers is backpressure.
I see two ways to fix this issue:
increase the parallelism of the ElasticsearchSink. This might help or not, depending on the capabilities of your Elasticsearch setup.
run both jobs as independent pipelines. In this case you'll have to compute all results twice.

Flink batch: data local planning on HDFS?

we've been playing a bit with Flink. So far we've been using Spark and standard M/R on Hadoop 2.x / YARN.
Apart from the Flink execution model on YARN, that AFAIK is not dynamic like spark where executors dynamically take and release virtual-cores in YARN, the main point of the question is as follows.
Flink seems just amazing: for streaming API's, I'd only say that it's brilliant and over the top.
Batch API's: processing graphs are very powerful and are optimised and run in parallel in a unique way, leveraging cluster scalability much more than Spark and others, optiziming perfectly very complex DAG's that share common processing steps.
The only drawback I found, that I hope is just my misunderstanding and lack of knowledge is that it doesn't seem to prefer data-local processing when planning the batch jobs that use input on HDFS.
Unfortunately it's not a minor one because in 90% use cases you have a big-data partitioned storage on HDFS and usually you do something like:
read and filter (e.g. take only failures or successes)
aggregate, reduce, work with it
The first part, when done in simple M/R or spark, is always planned with the idiom of 'prefer local processing', so that data is processed by the same node that keeps the data-blocks, to be faster, to avoid data-transfer over the network.
In our tests with a cluster of 3 nodes, setup to specifically test this feature and behaviour, Flink seemed to perfectly cope with HDFS blocks, so e.g. if file was made up of 3 blocks, Flink was perfectly handling 3 input-splits and scheduling them in parallel.
But w/o the data-locality pattern.
Please share your opinion, I hope I just missed something or maybe it's already coming in a new version.
Thanks in advance to anyone taking the time to answer this.
Flink uses a different approach for local input split processing than Hadoop and Spark. Hadoop creates for each input split a Map task which is preferably scheduled to a node that hosts the data referred by the split.
In contrast, Flink uses a fixed number of data source tasks, i.e., the number of data source tasks depends on the configured parallelism of the operator and not on the number of input splits. These data source tasks are started on some node in the cluster and start requesting input splits from the master (JobManager). In case of input splits for files in an HDFS, the JobManager assigns the input splits with locality preference. So there is locality-aware reading from HDFS. However, if the number of parallel tasks is much lower than the number of HDFS nodes, many splits will be remotely read, because, source tasks remain on the node on which they were started and fetch one split after the other (local ones first, remote ones later). Also race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request.
IIRC, the number of local and remote input split assignments is written to the JobManager logfile and might also be displayed in the web dashboard. That might help to debug the issue further. In case you identify a problem that does not seem to match with what I explained above, it would be great if you could get in touch with the Flink community via the user mailing list to figure out what the problem is.

Resources