Basic Flink streaming question as far as data egress is concerned - apache-flink

I am currently working on a streaming platform that accepts an unbounded stream from a source into Kafka. I am using Flink as my stream processing engine. I am able to ingest data successfully, window it on event time and do whatever I want to do in Flink. The output of this stream currently goes into a Kafka sink, which is ok for now since this data will not be streamed anywhere. This entire setup is deployed on AWS.
A external client is now interested in the data. The client wants the data in a streaming format instead of pulling the data from Kafka. We also do not want to expose our Kafka brokers to the external world. How can we achieve this? I tried pushpin proxy to "push" the data out. However, it's a pain to setup and manage.
Any idea how to approach this? I am really open to any ideas.
Thanks

Related

Do websockets work with any data source such as DB2?

I'm starting to learn websockets and I would like to know if they'res supported by a database like DB2 (or some other data-source)
Say I have a Spring Boot application, that provides data to a UI as a service. Typically, I would run SQL SELECT statements every so seconds from the Java application. However, I want to have a stream of data in the table (or perhaps a stream of just the changes made to the table) similar to having an open websocket connection to a Kafka topic .
Is it possible to use something like a STOMP websocket to have a connection opened to a DB2 table where it will stay open and consistently pull data? Does the data-source have to support websockets in order for that to work?
No they do not. RDBMS client-server protocols are more involved than just streaming a load of bytes for the client to interpret.
Having said that, database connections are already persistent, duplex, and stateful, and had been long before the WebSocket protocol was conceived.

Apache Nifi Site To Site Data Partitioning

I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,
Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.

Does akka-streams support clustering ? If yes, Please share example

I am using akka stream in my application to handle realtime data.
Data volume is so high so I want to horizontally scale my application.
Can some help me to understand does akka-streams support clustering ? If yes, Please share example.
I have found no examples or documentation that would indicate akka-stream is capable of directly running in a clustered mode. However, there are a few "work-arounds" that you may be able to deploy.
Integration with Akka Cluster
You could have a Flow.map or Flow.mapAsync send an incoming object to the cluster and wait for the response.
Sharding of Incoming Data
The source data could be broken up by a sharding function and sent to independent services which process the data in parallel. Each service could operate with a single akka-stream but the cluster of applications would allow for multiple streams running independently.

API source support in flink

Does flink provide a way to poll from an api periodically and create a datastream object out of it for further processing?
We currently push the messages to kafka and read through kafka. Is there any way to poll the api directly through flink?
I'm not aware of such a source connector for Flink, but it would be relatively straightforward to implement one. There are examples out there that do just this but with a database query; one of those might serve as a template for getting started.

Spark streaming and database caching

I have a Spark streaming application which does lot of database access. We are using an abstraction layer (basically a JAVA library) which maintains CACHE to access database. This CACHE layer (at client side) helps to speed up query processing.
I know that in context of Normal host (non spark) this really helps. But when it comes to Spark Streaming I doubt whether this CLIENT side caching would help as data may not remain cached for a long time as Spark Streaming behaves differently.
Can somebody please clarify.

Resources