How does the Flink Optimizer decide on parallelism? - apache-flink

Below is a slide about Flink's optimizer from my a presentation I watched. I'm particularly confused about the comment that Flink's optimizer decides on parallelism depending on the cardinalities of the provided dataset.
I'm currently going through the Flink 1.4 (the version I'm using) documentation and I can't seem to find any documentation regarding Flink's decision on parallelism. Do I need to provide Flink's optimizer with statistics about the datasets in order to take advantage of this feature?
On a related note, I thought that by specifying a maxParallelism value, this potentially would enable Flink to dynamically determine what level of parallelism would be appropriate for the provided dataset automatically (as detailed above). However, I'm unable to specify max parallelism as specified by the Flink 1.4 documentation, which is why I haven't been able to verify my hypothesis. For some context, I am using the DataSet API. How do I specify max parallelism in Flink?
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setMaxParallelism(20); // can't seem to call this method on env

Not sure where you found this presentation but it is quite old, probably 2014 or early 2015.
The slide discusses the optimizer of Flink's DataSet API. The optimizer is not used to optimize DataStream API programs. On the other hand, the setting of the maximum parallelism is only applicable for DataStream API programs but not for DataSet programs.
The quoted sentence is under the bullet point "Goal: efficient execution plans for data processing plans". Not all of its subpoints have been implemented, including automatic configuration of exeuction parallelism.
The roadmap of the Flink community includes the plan to integrate the DataSet API into the DataStream API and drop the optimizer. Flink's Table API / SQL will continue to have a cost-based optimizer (based on Apache Calcite) and might also configure the execution parallelism in the future.

Related

Extensive benchamark for Flink in stream processing

I am using Flink at my company and I am considering to apply several scenarios to see the performance of each case.
Below is the scenarios that I will work on
Experiments
End-to-End
Exactly-At-Once or At-least-once
source : kafka
sink : Mysql and Redis
logic : simple counting logic
For the Exactly-At-Once, I will use the TwoPhaseCommitSink for achieving the case.
Before doing experiment, I am wondering some issues as below.
The performance speed of the sink
As you can see, I will use the mysql (RDB) for the sink. Is there any descriptive benchmarks result when we use the RDB for at-least-once or exactly-at-once? I think that when the sink uses the database, the throughput will be influenced because it takes some time to connect and communicate with database. But I cannot find any documents or technical blogs showing the detailed results of benchmark of Flink when using the Sink for RDB.
Especially, I am also wondering that the Exactly-at-once will have more degraded performance than the at-least-once and it is hard to use the commercial purpose because of its slow processing.
So my question is as below.
Is there any informative results for the two semantics mode (at least once, exactly at once) using the database sink (mysql or redis)?
Exactly-at-once semantics for end-to-end will be very slow when using the mysql sink? I will apply the twophasecommitsink.
Thanks.
A few reactions:
Simple, generic Flink benchmarks are pretty useless as predictors of specific application performance. So much depends on what a specific job is doing, and there's a lot of room for optimization.
Exactly-once with two-phase commit sinks is costly in terms of latency, but not so bad with respect to throughput. The issue is that the commit has to be done in concert with a checkpoint. If you need to checkpoint frequently in order to reduce the latency, then that will more significantly harm the throughput.
Unaligned checkpoints and the changelog state backend can make a big difference for some use cases. If you want to include these in your testing, be sure to use Flink 1.16, which saw significant improvements in these areas.
The Flink project has invested quite a bit in having a suite of benchmarks that run on every commit. See https://github.com/apache/flink-benchmarks and http://codespeed.dak8s.net:8000/ for more info.

Flink - take advantage of input partitioning to avoid inter task-manager communications

We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.

Flink autoscaling and max parallelism

Quote from javadoc on StreamExecutionEnvironment.setMaxParallelism:
The maximum degree of parallelism specifies the upper limit for dynamic scaling.
Which exactly dynamic scaling is meant here? I couldn't find any empirical evidence of operator auto scaling: whatever number of free slots you have, and no matter how big is maxParallelism, and how many logical partitions is there, the actual parallelism (according to the web ui) is always the one that was set through a setParallelism
Also, according to this the most accepted and never challenged answer https://stackoverflow.com/a/43493109/2813148
there's no such thing as dynamic scaling in Flink.
So is there any? Or the javadoc is misleading (or what's the meaning of "dynamic" there)?
If there's none, are there any plans for this feature?
Flink (in version 1.5.0) does not support dynamic scaling yet.
However, job can be manually scaled (or by an external service) by taking a savepoint, stopping the running job, and restarting the job with an adjusted (smaller or larger) parallelism. However, the new parallelism can be at most the previously configured max-parallelism. Once a job was started, the max-parallelism is baked into the savepoints and cannot be changed anymore.
Support for dynamic scaling is on the roadmap. Since version 1.5.0 (released in May 2018), Flink supports dynamic resource allocation from resource managers such as Yarn and Mesos. This is an important step towards dynamic scaling. In fact, an experimental version of this feature has been demonstrated at Flink Forward SF 2018 in April 2018.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

#Apache-flink: use case for Data Management

I am trying to build data management (DM) solution involving high volume data ingestion, pass through some data domain rules, substitution (enrichment), flag the erroneous data before sending it off to downstream system. The rules checking & value replacement can be something simple like permissible threshold numeric values that the data elements should satisfy, to something more complex like lookup with master data for domain pool of values.
Do you think that Apache Flink can be a good candidate for such processing? Can there be flink operators defined to do lookup (with the master data) for each tuple flowing through it? I think there are some drawbacks of employing Apache Flink for the latter question - 1) the lookup could be a blocking operation that would slowdown the throughput, 2) checkpointing and persisting the operator state cannot be done if the operator functions have to fetch master data from elsewhere.
What are the thoughts? Is there some other tool best at the above use case?
Thanks
The short answer is 'yes'. You can use Flink for all the things you mentioned, including data lookups and enrichment, with the caveat that you won't have at-most-once or exactly-once guarantees on side effects caused by your operators (like updating external state.) You can work around the added latency of external lookups with higher parallelism on that particular operator.
It's impossible to give a precise answer without more information, such as what exactly constitutes 'high volume data' in your case, what your per-event latency requirements are, what other constraints you have, etc. However, in the general sense, before you commit to using Flink you should take a look at both Spark Streaming and Apache Storm and compare. Both Spark and Storm have larger communities and more documentation, so it might save you some pain in the long rum. Tags on StackOverflow at the time of writing: spark-streaming x 1746, apache-storm x 1720, apache-flink x 421
More importantly, Spark Streaming has similar semantics to Flink, but will likely give you better bulk data throughput. Alternatively, Storm is conceptually similar to Flink (spouts/bolts vs operators) and actually has lower performance/throughput in most cases, but is just a more established framework.

Resources