Adding Hadoop dependencies to standalone Flink cluster - apache-flink

I want to create a Apache Flink standalone cluster with serveral taskmanagers. I would like to use HDFS and Hive. Therefore i have to add some Hadoop dependencies.
After reading the documentation, the recommended way is to set the HADOOP_CLASSPATH env variable. But how do i have to add the hadoop files? Should i download the source files in some directory like /opt/hadoop ont the taskmanagers and set the variable to this path?
I only know the old but deprecated way downloading a Uber-Jar with the dependencies and place it under the /lib folder.

Normally you'd do the standard Hadoop installation, since you (for HDFS) need Node Managers running on every server (with appropriate configuration), plus the NameNode running on your master server.
So then you can do something like this on the master server where you're submitting your Flink workflow:
export HADOOP_CLASSPATH=`hadoop classpath`
export HADOOP_CONF_DIR=/etc/hadoop/conf

Related

Flink run job with remote jar file

I'm new to flink and trying to submit my flink program to my flink cluster.
I have a flink cluster running on remote kubernetes and a blob storage on Azure.
I know how to submit a flink job when I have the jar file on my local machine but no idea how to submit the job with the remote jar file(the jar can be access by https)
checked the documents and it seems doesn't provide something like what we do in spark
Thanks in advance
I think you can use an init container to download the job jar into a shared volume, then submit the local jar to Flink.
Ads: Google's Flink Operator supports remote job jar, see this example.

How to execute flink job remotely when the flink job jar is bulky

I have flink server running on Kubernetes cluster. I have a job jar which is bulky due to product and third party dependencies.
I run it via
ExecutionEnvironment env = ExecutionEnvironment.createRemoteEnvironment(host, port, jar);
The jar size is around 130 MB after optimization.
I want to invoke the remoteExecution without jar upload so that the upload does not happen everytime when the job needs to be executed. Is there a way to upload the jar once and call it remotely without mentioning the jar (in java)?
You could deploy a per job cluster on Kubernetes. This will submit your user code jar along with the Flink binaries to your Kubernetes cluster. The downside is that you cannot change the job afterwards without restarting the Flink cluster.

Where can I find my jar on Apache Flink server which I submitted using Apache Flink dashboard

I developed a Flink job and submitted my job using Apache Flink dashboard. Per my understanding, when I submit my job, my jar should be available on Flink server. I tried to figure out path of my jar but couldn't able to. Does Flink keep these jar file on server? If yes, where I can find? Any documentation? Please help. Thanks!
JAR files are renamed when they are uploaded and stored in a directory that can be configured with the web.upload.dir configuration key.
If the web.upload.dir parameter is not set, the JAR files are stored in a dynamically generated directory under the jobmanager.web.tmpdir (default is System.getProperty("java.io.tmpdir")).

How to copy data between multiple Jenkins master and slave setup

I have two jenkins masters, namely A and B. I am wondering how would a slave from master A copy data from master B? Is there any plugin available to do this kind of job?
There are few plugins that can help:
Publish via SSH
Publish to a FTP server
Publish to a Windows file share
Also you may try this python script to download last successful artifacts from Jenkins via Rest API. We use it in our production and it works very well.

How to transfer a ssis package from Dev to Prod?

I'm trying to move my packages to production using a configuration file, but file is changed only partly and the results go still to DEV server.
Does anybody know what to do?
It is difficult to isolate the cause of your issues without access to your configuration files.
What I suggest you do is make use of package configurations that reference a database within your environment. The databases themselves can then be referenced using environment variables that are unique to each environment.
This a brilliant time saver and a good way to centrally manage the configuration of all your SSIS packages. Take a look at the following reference for details.
http://www.mssqltips.com/tip.asp?tip=1405
Once configured, you can deploy the same identical package between dev and production without needing to apply a single modification to the SSIS package or mess around with configuration files.
You could still have hard-coded connections in your package even though you are using a configuration file. You'll need to check every connection as well.
You can also go the long way around. Go into Integration Services and Export the stored package to its dtsx file. Then you can pull open the file in any good text editor, do a find/replace on your server name and then go back into Integration Services and Import the updated package. Alot of times it's just easier...
everybody and thanks for answering. I'd managed to solve this problem in an ugly way - editing packages on server, but I'd like very much more elegant solution - now I'm trying with environment variable,it seems great, but the wizard that I'm getting is different from that is given in link - and I don't know how to continue.(I'm using VStudio 2005) Besides, I tried configuration file as XML, but package run fails even on the source machine, so I'm stuck !
My personal technique has been to first have a single config file that points the package to a SQL Based Package Config (the connection string to the config DB). Subsequent entries in the package config use the SQL store to load their settings. I have a script that goes into the XML of the package and preps them for deployment to stage or prod. A config file holds the name of the Package Configuration's initial file config entry and where the stage and prod configuration db configruation file is located. The script produces two subdirectories for stage and prod. Each directory has a copy of the solution packages modified for their particular deployment.
Also! Don't forget to turn off encryption in the package files!

Resources