Flink support for hadoop 3.X - apache-flink

I am using Flink 1.13.1 and we plan to injest ftp based sources. Since Flink depends on Hadoop dependencies to support such operations, I need to understand if Flink supports Hadoop 3.X. The documentation doesn't mention anything about this.

Related

Apache Flink sinking its results into OpenSearch

I am running Apache Flink v1.14 on the server which does some pre-processing on the data that is reads from Kafka. I need it to write the results to OpenSearch after which I can fetch the results from OpenSearch.
However, when going through the list of flink v1.14 connectors, I don't see OpenSearch. Is there any other way I can implement it?
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/overview/
In the above link, I see only ElasticSearch, no OpenSearch
I think the OpenSearch sink has been added in Flink 1.16, so You may consider updating Your cluster. Otherwise, You may need to port the changes to 1.14 (which shouldn't be hard at all) and push as a custom library.

Feasible Streaming Suggestions | Is it possible to use Apache Nifi + Apache Beam (on Flink Cluster) with Real Time Streaming Data

So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application:
As we have NiFi connectors available for Flink, and we can easily use Beam abstraction over Flink. Can I use NiFi as the dataflow tool to drive data from MiNiFi to Flink Cluster (here store it in-memory or something) and then use Beam Pipeline to process the data further.
Is there any NiFi connector for the beam ? If not can we do so? So, we directly stream data from NiFi to the Beam job (running on a Flink Cluster)
I am still in the early design phase, it would be great if we can discuss possible workarounds. Let me know if you need any other details.

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

Where is the JobManager on embedded Flink instances?

I am developing an application with multiple (micro)services.
I am using Flink (over Kafka) to stream messages between the services. Flink is embedded in the Java applications, each running in a separate docker container.
This is the first time I'm trying Flink and after reading the docs I still have a feeling I'm missing something basic.
Who is managing the jobs?
Where is the JobManager running?
How do I monitor the processing?
Thanks,
Moshe
I would recommend this talk by Stephan Ewen at Flink Forward 2016. It explains the current Apache Flink architecture (10:45) for different deployments as well as future goals.
In general, the JobManager is managing Flink jobs and TaskManagers execute your job consisting of multiple tasks. How the components are orchestrated depends on your deployment (local, Flink cluster, YARN, Mesos etc.).
The best tool for monitor your processing is the Flink Web UI at port 8081 by default, it offers different metrics for debugging and monitoring (e.g. monitoring checkpointing or back-pressure).

Nutch and save crawl data to Amazon S3

I am trying to evaluate if Nutch/Solr/Hadoop are the right technologies for my task.
PS: Previously I was trying to integrate Nutch (1.4) and Hadoop to see how it works.
Here is what I am trying to achieve overall,
a) Start with a Seed URL(s) and crawl and parse/save data/links
--Which Nutch crawler does anyway.
b) Then be able to query the crawled indexes from a Java client
--- (may be either using SolrJ client)
c) Since Nutch (as of 1.4.x) already uses Hadoop internally. I will just install Hadoop and configure in the nutch-**.xml
d) I would like Nutch to save the crawled indexes to Amazon S3 and also Hadoop to use S3 as file system.
Is this even possible? or even worth it?
e) I read in one of the forums, that in Nutch 2.0, there is a data layer using GORA that can save indexes to HBase etc. I don't when 2.0 release is due. :-(
Does anyone suggest to grab 2.0 "inprogress" trunk and start using it, hoping to get a released lib sooner or later?
PS: I am still trying to figure out how/when/why/where Nutch uses Hadoop internally. I just cannot find any written documentation or tutorials..Any help on this aspect is also much appreciated.
If you are reading this line, then thank you so much for reading this post up to this point :-)
Hadoop can use S3 as its underlying file system natively. I have had very good results with this approach when running Hadoop in EC2, either using EMR or your own / third-party Hadoop AMIs. I would not recommend using S3 as the underlying file system when using Hadoop outside of EC2, as bandwidth limitations would likely negate any performance gains Hadoop would give you. The S3 adapter for Hadoop was developed by Amazon and is part of the Hadoop core. Hadoop treats S3 just like HDFS. See http://wiki.apache.org/hadoop/AmazonS3 for more info on using Hadoop with S3.
Nutch is designed to run as a job on a Hadoop cluster (when in "deploy" mode) and therefore does not include the Hadoop jars in its distribution. Because it runs as a Hadoop job, however, it can access any underlying data store that Hadoop supports, such as HDFS or S3. When run in "local" mode, you will provide your own local Hadoop installation. Once crawling is finished in "deploy" mode, the data will be stored in the distributed file system. It is recommended that you wait for indexing to finish and then download the index to a local machine for searching, rather than searching in the DFS, for performance reasons. For more on using Nutch with Hadoop, see http://wiki.apache.org/nutch/NutchHadoopTutorial.
Regarding HBase, I have had good experiences using it, although not for your particular use case. I can imagine that for random searches, Solr may be faster and more feature-rich than HBase, but this is debatable. HBase is probably worth a try. Until 2.0 comes out, you may want to write your own Nutch-to-HBase connector or simply stick with Solr for now.

Resources