Apache Nifi Site To Site Data Partitioning - apache-flink

I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,

Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.

Related

Apache Flink sinking its results into OpenSearch

I am running Apache Flink v1.14 on the server which does some pre-processing on the data that is reads from Kafka. I need it to write the results to OpenSearch after which I can fetch the results from OpenSearch.
However, when going through the list of flink v1.14 connectors, I don't see OpenSearch. Is there any other way I can implement it?
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/
In the above link, I see only ElasticSearch, no OpenSearch
I think the OpenSearch sink has been added in Flink 1.16, so You may consider updating Your cluster. Otherwise, You may need to port the changes to 1.14 (which shouldn't be hard at all) and push as a custom library.

Understanding the difference between Apache camel and Kafka Stream

Being quite familiar with Apache Camel, I am a new bee in Kafka Streams. I am learning Kafka streams, but could not find any relevant answer for the below query,
Being a library both Camel and Kafka Streams can create pipelines to extract data, polishing/transforming and load into some sink using a processor. Camel also supports stream processing. I want to understand the
difference between these two since I feel Camel library to be more generic than Kafka Stream which is not relevant for systems where there is no Kafka broker (no sure if this is wrong)
which library is recommended for which type of use case
Thanks in advance.
Kafka Streams is a stream processing framework, that consumes messages from Kafka topics and writes them back to other Kafka topics. It brings support for stateful transformations such as aggregations to tables and similar, leveraging RocksDB, when necessary. You can provide Rest endpoints to such tables/stores, but that is already extending the Kafka Streams features.
Another possible extension is, to send messages somewhere else than Kafka. You will have to provide the client to do so yourself. With that regards, Kafka Streams' scope is much less versatile than Apache Camel. Because of that specialisation, it supports various Kafka specific features, such as parallel processing based on Kafka consumer groups, predefined message envelopes and exactly once semantics. One of the most important feature is the support of "stream time" in Kafka streams, which allows reprocessing of messages by their Kafka timestamps regardless of the Wall-Clock-Time.
You can have a look on KSQL, which is build on top of Kafka Streams, to get an idea, what is possible to build with Kafka Streams.
In short, if you have data in Kafka, that you want to process and write back to Kafka for other programs to consume, Kafka Streams is a very helpful framework. It even has a similar deployment model as Apache Camel. However, if you need to integrate different technologies with Kafka, you need to stay with Apache Camel. Note, there is Kafka Connect in the Apache Kafka family, that is geared towards the integration of data from other systems with Apache Kafka.

Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data

So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application:
As we have NiFi connectors available for Flink, and we can easily use Beam abstraction over Flink. Can I use NiFi as the dataflow tool to drive data from MiNiFi to Flink Cluster (here store it in-memory or something) and then use Beam Pipeline to process the data further.
Is there any NiFi connector for the beam ? If not can we do so? So, we directly stream data from NiFi to the Beam job (running on a Flink Cluster)
I am still in the early design phase, it would be great if we can discuss possible workarounds. Let me know if you need any other details.

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

Basic Flink streaming question as far as data egress is concerned

I am currently working on a streaming platform that accepts an unbounded stream from a source into Kafka. I am using Flink as my stream processing engine. I am able to ingest data successfully, window it on event time and do whatever I want to do in Flink. The output of this stream currently goes into a Kafka sink, which is ok for now since this data will not be streamed anywhere. This entire setup is deployed on AWS.
A external client is now interested in the data. The client wants the data in a streaming format instead of pulling the data from Kafka. We also do not want to expose our Kafka brokers to the external world. How can we achieve this? I tried pushpin proxy to "push" the data out. However, it's a pain to setup and manage.
Any idea how to approach this? I am really open to any ideas.
Thanks

Resources