Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink? - apache-flink

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?
If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
If the above is possible, what are the limitations?
What I've Learned so Far
The functionality I am describing is already implemented by reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
reinterpretAsStream docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka
Apache Flink - how to align Flink and Kafka sharding
In addition to the above, all the discussions related to reinterpretAsKeyedStream that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream
Context of my Application
Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)
Any help/insight is much appreciated, thanks!

I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

Related

Flink - take advantage of input partitioning to avoid inter task-manager communications

We have a Flink pipeline aggregating data per "client" by combining data with identical keys ("client-id") and within the same window.
The problem is trivially parallelizable and the input Kafka topic has a few partitions (same number as the Flink parallelism) - each holding a subset of the clients. I.e., a single client is always in a specific Kafka partition.
Does Flink take advantage of this automatically or will reshuffle the keys? And if the latter is true - can we somehow avoid the reshuffle and keep the data local to each operator as assigned by the input partitioning?
Note: we are actually using Apache Beam with the Flink backend but I tried to simplify the question as much as possible. Beam is using FixedWindows followed by Combine.perKey
I'm not familiar with the internals of the Flink runner for Beam, but assuming it is using a Flink keyBy, then this will involve a network shuffle. This could be avoided, but only rather painfully by reimplementing the job to use low-level Flink primitives, rather than keyed windows and keyed state.
Flink does offer reinterpretAsKeyedStream, which can be used to avoid unnecessary shuffles, but this can only applied in situations where the existing partitioning exactly matches what keyBy would do -- and I see no reason to think that would apply here.

Understanding how Akka provides back pressure

So, we have a use case in our production systems where we could probably use Akka streams. To understand how Akka streams exactly provide back pressure, I would like to go a bit deeper into our requirements.
We have a Solr cluster that hosts some of our data. Next, we have a Play app that serves the front-end customer facing site. Every in-coming request ultimately boils down to fetching a good deal of data from Solr using the /sql handler that Solr provides. Once we fetch the entire dataset from Solr, we write it back after morphing it, to a Cassandra cluster. This can be converted into a problem which can be solved using Akka streams where the Solr stream from the /sql handler will be the akka Source and the Cassandra storage will be the Sink and everything in between will be custom Flows.
I was studying Akka streams and understand it's an implementation of the Reactive streams. Most notably, the way Akka streams provide back pressure to make sure the customer isn't overwhelmed by the producer. Now, with respect to my use case, I want to understand how Akka provides back pressure.
As I can see it, there's a reactive streams library for Cassandra. Since it's the consumer in our case, this driver will be capable of signalling to the producer about how much data it will be able to receive. That would mean, there has to be a corresponding driver on the producer side that can react to this signal and control the emitting of elements. Specifically, since the producer in our case is Solr, isn't it correct that I would also have to use a reactive-compliant Solr driver that I can use to fetch documents from Solr and stream it in my application? This driver would then be capable of controlling the rate at which it has to fetch the documents from the Solr cluster whenever the Cassandra reactive driver signals it to backpressure. Isn't this correct?
If that is indeed the case, will using Akka streams without a non-reactive driver on the producer side provide any benefits? Specifically, are there other ways that Akka publishers can provide back pressure capabilities in such cases when the driver isn't reactive-compliant?
For Solr, there's also a fully reactive Akka Streams implementation from the Alpakka project, so using that as the Source would handle backpressure, though it would mean not using the SQL interface for expressing the query.
On the other hand, since the Solr SQL interface is essentially a JDBC facade which uses Solr, it's possible to use the Alpakka Slick integration as long as you define an instance of slick.jdbc.JdbcProfile which uses the Solr JDBC driver.

Akka streams vs Apache Flink

While exploring Akka streams, I also came across Apache Flink which stream processing engine.
Akka streams implements reactive streams and supports back pressure.
So if I have to make decision between two, which one should I go for? How do they differ and whats the similarity? What should be the criteria here?
Akka Streams is a library implementing reactive streams specification.
Apache Flink is a streaming engine.
The main high level difference is that in Apache Flink you create a job by coding against one of Flink APIs and you submit that job to Apache Flink cluster. It is the Apache Flink cluster that executes your stream processing job. By using Akka Streams you are creating a standalone application. In that sense Akka Streams is a more lightweight of the two.
You can still distribute Akka Streams based app by using StreamRefs, though you need to do that explicitly in the code and you need to run Akka Cluster. Apache Flink already manages a cluster so you don't need to do that explicitly in your code (though you still need the cluster set up and running to submit your jobs to). Apache Flink has smarts built in to take a job and execute it in an optimal way. Parallelizing/distributing execution when possible. You don't get that with Akka Streams.
Apache Flink stream processing is designed to achieve end2end exactly once processing semantics in face of failures. In Akka Streams such guarantee would need to be implemented explicitly in your code.
Akka Streams as reactive streams specification implementation is all about asynchronous and memory bound processing. Akka HTTP for example is built on top of Akka Streams and as a result implements a very efficient and lightweight client and server sides of HTTP protocol.
Akka Streams implements asynchronous non-blocking backpressure (as per reactive streams specification) to guarantee the memory boundedness during execution. Apache Flink also has a backpressure mechanism, though it's not implemented in the same way.
Akka Streams as an implementation of reactive streams specification can interoperate with other implementations like RxJava or Project Reactor. Apache Flink is not part of any broader standard.
I would say the main reasons to go for Apache Flink is the exactly once guarantees and automated distribution that comes with it. Otherwise Akka Streams is a very powerful API with simpler execution model.
EDIT:
Probably worth mentioning project Alpakka that brings a lot of technologies to Akka Streams so that they can be plugged in to reactive streams based processing.
I am not an expert in Akka Streams, but as far as I know, the main difference is that Flink offers the distribution of processing out of the box, while Akka Streams does not, since it was designed to process data on a single node.
The similarity between the two is that they both offer stream processing capabilities and in this sense, they probably have similar functionality.
But, Flink has multiple additional modules like SQL, CEP, or Machine Learning that You won't be able to get in Akka Streams. Also, Flink provides fail-safety and state recovery, which I am not sure if is present in Akka Streams out of the box.
On the other hand, setting up Akka Streaming will require less work as You don't need to care about setting JobManager & TaskManager but You can simply create a Java/Scala application, dockerize & run it somewhere.
So, the main question You should ask Yourself is, if the data You are processing is big enough that it will need to be processed on multiple nodes if it is then You really have no choice other than Flink (just in scenario Akka Streams vs. Flink). If however, the data You are going to process can be processed on a single node, then You should assess the fail-safety & message delivery guarantees You need. In the general case scenario, using Akka Streams may be easier to start with, but Flink may take over when it comes to productionizing the app.

How can I access states computed from an external Flink job ? (without knowing its id)

I'm new to Flink and I'm currently testing the framework for a usecase consisting in enriching transactions coming from Kafka with a lot of historical features (e.g number of past transactions between same source and same target), then score this transaction with a machine learning model.
For now, features are all kept in Flink states and the same job is scoring the enriched transaction. But I'd like to separate the features computation job from the scoring job and I'm not sure how to do this.
The queryable state doesn't seem to fit for this, as the job id is needed, but tell me if I'm wrong !
I've thought about querying directly RocksDB but maybe there's a more simple way ?
Is the separation in two jobs for this task a bad idea with Flink ? We do it this way for the same test with Kafka Streams, in order to avoid complex jobs (and to check if it has any positive impact on latency)
Some extra information : I'm using Flink 1.3 (but willing to upgrade if it's needed) and the code is written in Scala
Thanks in advance for your help !
Something like Kafka works well for this kind of decoupling. In that way you could have one job that computes the features and streams them out to a Kafka topic that is consumed by the job doing the scoring. (Aside: this would make it easy to do things like run several different models and compare their results.)
Another approach that is sometimes used is to call out to an external API to do the scoring. Async I/O could be helpful here. At least a couple of groups are using stream SQL to compute features, and wrapping external model scoring services as UDFs.
And if you do want to use queryable state, you could use Flink's REST api to determine the job id.
There have been several talks at Flink Forward conferences about using machine learning models with Flink. One example: Fast Data at ING – Building a Streaming Data Platform with Flink and Kafka.
There's an ongoing community effort to make all this easier. See FLIP-23 - Model Serving for details.

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Resources