What should be excluded from a Apache Flink Job Jar? - apache-flink

When packaging a flink job which uses for example some flink connectors and some third-party libraries (for processing), which dependencies should end up in the jobs jar so it can be launched in a flink-cluster using ("flink run [jarfile]")?
Is making a fat-jar the desired approach?
If writing a job in scala, do you include the scala default library in the jar?
I didn't find any documentation on how to package a job for flink once it is written.

Yes, a fat-jar is the standard way to package a Flink job. Everything that is contained in the Flink distribution must not be included (ie, Java and Scale default libraries, Flink core, ...). Only some Flink libraries that are not contained (plus user defined external dependencies) must be included in the fat-jar.
You can follow this guideline from the Flink documenation: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/cluster_execution.html#linking-with-modules-not-contained-in-the-binary-distribution
This might also be helpful: https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/common/index.html#program-packaging-and-distributed-execution

Related

Flink 1.13.2 not updates metrics in near-real-time when connected to kafka sources/sink

I'm creating a process to handle millions of records with apache flink to support logistics data pipelines. I'm moving from kinesis sources/sink to kafka sources/sink.
However, in the flink dashboard, the job metrics are not being updated in the near-real-time. Do you know what can be wrong with the job/version?
Btw, when job is closed, then it can show all metrics... but not in near-real-time...
Job non-updating metrics picture
Fixed after cleanup conflict dependencies on "Kafka-clients" lib.
So, in my case, using also some avro & cloudevents libs with higher Kafka-clients version. Then, just need to exclude Kafka-clients from these libs and prefer flink Kafka-clients version. And this solved the issue.

Is there any way to upload third party dependencies and files in Flink jobs?

In Spark, instead of building a fat-jar, we are able to upload third-party dependencies required for my job via --jars.
Similarly, I am able to upload the configuration files needed for my job via --files.
But in Flink, I checked --help and didn't find a similar option.
So:
For dependent jars, is here a way to load third-party dependencies for jobs on-demand without modifying the cluster environment? (The required third-party dependencies may be different for different jobs)
For the configured files, is there a way to upload them like --files? And how do I read it?
Thanks.
additional
When use flink run to submit a job to yarn-per-job, I can read my local-file in flink.
But when using flink run-application, I tried a variety of ways, are unable to get to read the local-file, there is any way to go to the it.
I am using yarn. Flink version: 1.14

How does a PyFlink job call external jar?

I want to call my Java interfaces in a jar file in a PyFlink job. No solutions are found in the offical document.
It looks to me like support for this was not included in Flink 1.9, but is ongoing work. See FLIP-58. FLIP-78 and FLIP-88 may also be of interest. Note that most of these improvements will be included in the upcoming Flink 1.10 release.
You can use python table api to register java user-defined function if it satisfies your need. The signature of method is register_java_function in table_environment

How to integrate non-Confluent connectors with Apache Kafka Connect

There is a requirement where we get a stream of data from Kafka Stream and our objective is to push this data to SOLR.
We did some reading but we could find there are lot of Kafka Connect solutions available in the market, but the problem is we do not know which is the best solution and how to achieve.
The options are:
Use Solr connector to connect with Kafka.
Use Apache Storm as it directly provides support for integrating with Solr.
There is no much documentation or in depth information provided for the above mentioned options.
Will anyone be kind enough to let me know
How we can use a Solr connector and integrate with Kafka stream without using Confluent?
Solr-Kafka Connector: https://github.com/MSurendra/kafka-connect-solr
Also, With regard to Apache Storm,
will it be possible for Apache Storm to accept the Kafka Stream and push it to Solr, though we would need some sanitization of data before pushing it to Solr?
I am avoiding Storm here, because the question is mostly about Kafka Connect
CAVEAT - This Solr Connector in the question is using Kakfa 0.9.0.1 dependencies, therefore, it is very unlikely to work with the newest Kafka API's.
This connector is untested by me. Follow at your own risk
The following is an excerpt from Confluent's documentation on using community connectors, with some emphasis and adaptations. In other words, written for Kafka Connects not included in Confluent Platform.
1) Clone the GitHub repo for the connector
$ git clone https://github.com/MSurendra/kafka-connect-solr
2) Build the jar with maven
Change into the newly cloned repo, and checkout the version you want. (This Solr connector has no releases like the Confluent ones).
You will typically want to checkout a released version.
$ cd kafka-connect-solr; mvn package
From here, see Installing Plugins
3) Locate the connector’s uber JAR or plugin directory
We copy the resulting Maven output in the target directory into one of the directories on the Kafka Connect worker’s plugin path (the plugin.path property).
For example, if the plugin path includes the /usr/local/share/kafka/plugins directory, we can use one of the following techniques to make the connector available as a plugin.
As mentioned in the Confluent docs, the export CLASSPATH=<some path>/kafka-connect-solr-1.0.jar option would work, though plugin.path will be the way moving forward (Kafka 1.0+)
You should know which option to use based on the result of mvn package
Option 1) A single, uber JAR file
With this Solr Connector, we get a single file named kafka-connect-solr-1.0.jar.
We copy that file into the /usr/local/share/kafka/plugins directory:
$ cp target/kafka-connect-solr-1.0.jar /usr/local/share/kafka/plugins/
Option 2) A directory of dependencies
(This does not apply to the Solr Connector)
If the connector’s JARs are collected into a subdirectory of the build’s target directories, we can copy all of these JARs into a plugin directory within the /usr/local/share/kafka/plugins, for example
$ mkdir -p /usr/local/share/kafka/plugins/kafka-connect-solr
$ cp target/kafka-connect-solr-1.0.0/share/java/kafka-connect-solr/* /usr/local/share/kafka/plugins/kafka-connect-solr/
Note
Be sure to install the plugin on all of the machines where you’re running Kafka Connect distributed worker processes. It is important that every connector you use is available on all workers, since Kafka Connect will distribute the connector tasks to any of the worker
4) Running Kafka Connect
If you have properly set plugin.path or did export CLASSPATH, then you can use connect-standalone or connect-distributed with the appropriate config file for that Connect project.
Regarding,
we would need some sanitization of data before pushing it to Solr
You would need to do that with a separate process like Kafka Streams, Storm, or other process prior to Kafka Connect. Write your transformed output to a secondary topic. Or write your own Kafka Connect Transform process. Kafka Connect has very limited transformations out of the box.
Also worth mentioning - JSON seems to be the only supported Kafka message format for this Solr connector

How to install Flink on Mesos cluster without DC/OS?

I am newbie in Apache Flink and our team is trying to set up an Apache Flink Cluster on Apaches Mesos. We have already installed Apache Mesos & Marathon with 3 Master nodes and 3 Slaves and now we are trying to install Apache Flink without DC/OS as mentioned here https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/mesos.html#mesos-without-dcos.
I have couple of questions over here :
Do we need to download Flink on all the nodes(master and slaves) and configure mesos.master in all nodes?
Or Shall we download flink on only one master node and configure mesos.master over there?
If flink needs to be downloaded on all the nodes then what should be the location of flink directory or if there is any script where I can specify that?
Is running "mesos-appmaster.sh" on master node also responsible for running flink libraries and classes on slaves?
Thanks
Do we need to download Flink on all the nodes(master and slaves) and configure mesos.master in all nodes?
No you don't. Actualy it depends on the way you want to run Flink. In your setup the most convenient way to run Flink would be to run it with Marathon and download binaries during deployment. See this
Or Shall we download flink on only one master node and configure mesos.master over there?
It's up to you. You can run Flink on dedicated server or let Marathon do it for you. If you already have Marathon then it's easier to run Flink with Marathon. On the other hand for debugging purposes and proof of concept I'll recommend standalone version where you can quickly change configuration on local machine and see how it works. Creating docker images or binaries and publishing them in repository and finally deploying Flink on Marathon could have more overhead that will slow you down on development but will keep you safe on production. Flink does not come with support for High Availability (HA) so Marathon is required to provide basic HA support (launch new instance of Flink when agent crash).
If flink needs to be downloaded on all the nodes then what should be the location of flink directory or if there is any script where I can specify that?
Flink does not have to be downloaded on all nodes. It can be downloaded when needed at deployment.
Is running "mesos-appmaster.sh" on master node also responsible for running flink libraries and classes on slaves?
Flink is a scheduler which means that it should start tasks and executors on Mesos when needed.
Even when not using DC/OS, feel free to look at the Apache Flink DC/OS package. At its core, it is a marathon app definition you can deploy on pure Marathon/Mesos. The Flink package (as of today) does not require any DC/OS specific features.
The DC/OS example might also provide useful information.

Resources