Does akka-streams support clustering ? If yes, Please share example - akka-stream

I am using akka stream in my application to handle realtime data.
Data volume is so high so I want to horizontally scale my application.
Can some help me to understand does akka-streams support clustering ? If yes, Please share example.

I have found no examples or documentation that would indicate akka-stream is capable of directly running in a clustered mode. However, there are a few "work-arounds" that you may be able to deploy.
Integration with Akka Cluster
You could have a Flow.map or Flow.mapAsync send an incoming object to the cluster and wait for the response.
Sharding of Incoming Data
The source data could be broken up by a sharding function and sent to independent services which process the data in parallel. Each service could operate with a single akka-stream but the cluster of applications would allow for multiple streams running independently.

Related

Can I avoid network shuffle when creating a KeyedStream from a pre-partitioned Kinesis Data Stream in Apache Flink?

Is it possible to create a KeyedStream from a pre-sharded/pre-partitioned Kinesis Data Stream without the need for a network shuffle (i.e. using reinterpretAsKeyedStream or something similar)?
If that is not possible (i.e. the only reliable is to consume from Kinesis and then use keyBy), then is network shuffling at least minimized by doing a keyBy on a the field that the source is sharded by (e.g. env.addSource(source).keyBy(pojo -> pojo.getTransactionId()), where the source is a kinesis data stream that is sharded by transactionId)
If the above is possible, what are the limitations?
What I've Learned so Far
The functionality I am describing is already implemented by reinterpretAsKeyedStream, but this feature is experimental and seems to have significant drawbacks (as per discussions in the stackoverflow posts below)
reinterpretAsStream docs: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/experimental/
Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka
Apache Flink - how to align Flink and Kafka sharding
In addition to the above, all the discussions related to reinterpretAsKeyedStream that I've found are in the context of Kafka, so I'm not sure how the outcomes differ for a Kinesis Data Stream
Context of my Application
Re. configurations: both the Kinesis Data Stream and Flink will be hosted serverlessly, and automatically scale up/down depending on load (which as I understand it, means that reinterpretAsKeyedStream cannot be used)
Any help/insight is much appreciated, thanks!
I don't believe there's any way to easily do what you want, at least not in a way that's resilient to changes in the parallelism of your source and your cluster. I have used helicopter stunts to achieve something similar to this, but it involved fragile code (depends on exactly how Flink handles partitioning).

Understanding how Akka provides back pressure

So, we have a use case in our production systems where we could probably use Akka streams. To understand how Akka streams exactly provide back pressure, I would like to go a bit deeper into our requirements.
We have a Solr cluster that hosts some of our data. Next, we have a Play app that serves the front-end customer facing site. Every in-coming request ultimately boils down to fetching a good deal of data from Solr using the /sql handler that Solr provides. Once we fetch the entire dataset from Solr, we write it back after morphing it, to a Cassandra cluster. This can be converted into a problem which can be solved using Akka streams where the Solr stream from the /sql handler will be the akka Source and the Cassandra storage will be the Sink and everything in between will be custom Flows.
I was studying Akka streams and understand it's an implementation of the Reactive streams. Most notably, the way Akka streams provide back pressure to make sure the customer isn't overwhelmed by the producer. Now, with respect to my use case, I want to understand how Akka provides back pressure.
As I can see it, there's a reactive streams library for Cassandra. Since it's the consumer in our case, this driver will be capable of signalling to the producer about how much data it will be able to receive. That would mean, there has to be a corresponding driver on the producer side that can react to this signal and control the emitting of elements. Specifically, since the producer in our case is Solr, isn't it correct that I would also have to use a reactive-compliant Solr driver that I can use to fetch documents from Solr and stream it in my application? This driver would then be capable of controlling the rate at which it has to fetch the documents from the Solr cluster whenever the Cassandra reactive driver signals it to backpressure. Isn't this correct?
If that is indeed the case, will using Akka streams without a non-reactive driver on the producer side provide any benefits? Specifically, are there other ways that Akka publishers can provide back pressure capabilities in such cases when the driver isn't reactive-compliant?
For Solr, there's also a fully reactive Akka Streams implementation from the Alpakka project, so using that as the Source would handle backpressure, though it would mean not using the SQL interface for expressing the query.
On the other hand, since the Solr SQL interface is essentially a JDBC facade which uses Solr, it's possible to use the Alpakka Slick integration as long as you define an instance of slick.jdbc.JdbcProfile which uses the Solr JDBC driver.

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

Apache Nifi Site To Site Data Partitioning

I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,
Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.

Spark streaming and database caching

I have a Spark streaming application which does lot of database access. We are using an abstraction layer (basically a JAVA library) which maintains CACHE to access database. This CACHE layer (at client side) helps to speed up query processing.
I know that in context of Normal host (non spark) this really helps. But when it comes to Spark Streaming I doubt whether this CLIENT side caching would help as data may not remain cached for a long time as Spark Streaming behaves differently.
Can somebody please clarify.

Resources