Apache Flink sinking its results into OpenSearch - apache-flink

I am running Apache Flink v1.14 on the server which does some pre-processing on the data that is reads from Kafka. I need it to write the results to OpenSearch after which I can fetch the results from OpenSearch.
However, when going through the list of flink v1.14 connectors, I don't see OpenSearch. Is there any other way I can implement it?
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/
In the above link, I see only ElasticSearch, no OpenSearch

I think the OpenSearch sink has been added in Flink 1.16, so You may consider updating Your cluster. Otherwise, You may need to port the changes to 1.14 (which shouldn't be hard at all) and push as a custom library.

Related

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

Apache Nifi Site To Site Data Partitioning

I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,
Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.

API source support in flink

Does flink provide a way to poll from an api periodically and create a datastream object out of it for further processing?
We currently push the messages to kafka and read through kafka. Is there any way to poll the api directly through flink?
I'm not aware of such a source connector for Flink, but it would be relatively straightforward to implement one. There are examples out there that do just this but with a database query; one of those might serve as a template for getting started.

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Querying Data from Apache Flink

I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.

Resources