Custom configuration file flink-conf.yaml - apache-flink

I need to specify different Flink settings for different applications. In other words, each application should run with its custom file flink-conf.yaml. What is the proper way to do it?
I found some old recommendations to declare FLINK_CONF_DIR pointing to a custom directory with Flink configuration files (for example: How could I override configuration value in Apache Flink?). However, the official Flink documentation does not mention the FLINK_CONF_DIR variable at all (as of Flink 1.13). Therefore I have doubts, that this way is officially recommended and supported by Flink developers.
UPDATE 1: Details on application running
I am running Flink on YARN in the Application mode. Here is how I launch the application:
"$flink_home/bin/flink" \
run-application \
--target yarn-application \
--class com.example.App1
The out-of-the-box Flink configuration is located in the $flink_home/conf directory. As I have several applications App1, App2, ..., I want them to use their respective Flink configurations instead of the out-of-the-box configuration.

TL;DR: The paragraph about FLINK_CONF_DIR was accidentally removed when the Flink on YARN docs were rewritten for the Flink 1.12 release. It is still the intended and supported way to establish per-application settings in YARN clusters.
Other ways to override the configuration:
You can override the settings specified in the cluster's flink-conf.yaml file with settings you specify on the command line, as described in this answer.
You can also override specific settings from the global configuration in your code, e.g.:
Configuration conf = new Configuration();
conf.setString("state.backend", "filesystem");
env = StreamExecutionEnvironment.getExecutionEnvironment(conf);
You can also load all of the settings in a flink-conf.yaml file from your application code, via
FileSystem.initialize(GlobalConfiguration.loadConfiguration("/path/to/conf/directory"));
And with Kubernetes you can mount different ConfigMaps for different applications.

Related

Flink logging - Using Log4j2

We are running a Flink(1.9.1) application on AWS-EMR(5.29) using yarn. We are using a common logging adaptor throughout all the components(including the Flink application) in our project and it uses Log4j2.
From the documentation, I see that there are 3 configuration files.
log4j.properties
log4j-yarn-session.properties
log4j-cli.properties
I understand that I will have to modify log4j.properties for the job manager and task manager logs and log4j-cli.properties for the code not included in the cluster code.
Now given this situation,
How do I pass my log4j2.properties?
Do we replace the logging jars in the lib folder with log4j2 jars?
Not a solid solution but this is a workaround. If the log4j.properties file in the /conf folder is deleted, the log4j2 file within the jar that is within the classpath is referred. But be careful when you have multiple jars in the classpath with the log4j2 properties file.

Apache Flink Dynamically setting JVM_OPT env.java.opts

Is it possible to set the custom JVM Options env.java.opts when submitting a job without specifying it in the conf/flink-conf.yaml file?
The reason I am asking is I want to use some custom variables in my log4j. I am also running my job on YARN.
I have tried the following command using the CLI and it strips everything off from the = sign onwards
$ flink run -m yarn-cluster -yn 2 -yst -yD env.java.opts="-DappName=myapp -DcId=mycId"
At the moment this is not possible due to the way Flink parses the dynamic properties. Flink assumes that dynamic properties have the form -D<KEY>=<VALUE> and that <VALUE> does not contain any = which is clearly wrong. Thus, for the moment you have to specify the env.java.opts via flink-conf.yaml.
I've opened a JIRA issue to fix this problem.
Update
The problem has been fixed for Flink >= 1.3.0 and >= 1.2.2.
A simple solution which I tried was passing the configuration parameters in application.properties as arguments like below,
~/flink/bin/flink run app.jar --Brokers=Broker1:9093 --TopicName=some-topic
Also you can also pass in the parameters as a properties file,
~/flink/bin/flink run app.jar -Dspring.config.name=<full-path>/application.properties

How to use CHE_EXTRA_VOLUME_MOUNT?

Use Case
The code that I wish to edit in che is downloaded from a private SVN repository and uses a private nexus repository for maven dependencies. Due to this I need to use my custom settings.xml from "C:\Users\.m2".
It would be good to use the local maven repository too, hence the approach of creating a custom dockerfile that adds settings.xml was not used.
Setup
I created a user environment variable "CHE_EXTRA_VOLUME_MOUNT" with the value "~/.m2:/home/user/.m2".
I can see the env variable from "Docker Quickstart Terminal".
Environment
OS: Windows 7
Docker version: 1.12.6, build 78d1802
Docker image: eclipse/che-server:5.0.0
Problem
Can't see the mount path "/home/user/.m2" in any workspace.
Can someone please help me with this use case?
I see a couple issues. First, in the che.env file, you should be modifying CHE_WORKSPACE_VOLUME. The CHE_EXTRA_VOLUME_MOUNT is an older name that applied to the 4.x releases.
Second, the mount path you are using. The value that you provided on the mount path is likely not going to work well if it's on Windows 7. This is because you are using Boot2Docker on that system, and so VirtualBox limits files that can be mounted to those that exist as a subfolder of %userprofile%.
So:
1. First make sure that c:\Users\.m2 is part of this subfolder, and then:
2. Use the absolute path to your .m2 folder in the mount in the che.env:
CHE_WORKSPACE_VOLUME=/C/Users/<user_name>/.m2:/home/user/.m2
This funky path naming for volume mounts is a limitation in how the Docker client can understand volume mounts if you are using it on the batch shell.
A matching answer is posted on Che's support site - https://github.com/eclipse/che/issues/3888
Looks like it is a bug in eclipse che. You can create an issue at https://github.com/eclipse/che/issues

How to set up subgit to mirror an svn repo that looks like a Windows Explorer hierarchy?

Being windows users, we created one svn repo with a hierarchy of folders. The bottom nodes contain the svn standard layout:
ProjectA/
ApplicationOne/
ModuleX/
trunk/
branches/
tags/
ApplicationTwo/
ModuleY/
trunk/
branches/
tags/
... and so on ad infinitum. The repo now contains around 100+ real svn projects with the trunk/branches/tags structure, but almost none of them at the top level.
How would I configure subgit to handle this?
SubGit can work in two different modes: local mirror mode and remote mirror mode. Below you can find a general overview of these modes and some recommendations for your particular case.
Local Mirror Mode
In this mode both Subversion and Git repositories reside on the same host, so SubGit has local access to both SVN and Git sides.
Below I've provided basic instructions. Please find detailed documentation and common pitfalls in SubGit 'Local Mode' Book.
Configuration
subgit configure <SVN_REPO>
SubGit version <VERSION> build #<BUILD_NUMBER>
Detecting paths eligible for translation... done.
Subversion to Git mapping has been configured:
/ProjectA/ApplicationOne/ModuleX : <SVN_REPO>/git/ProjectA/ApplicationOne/ModuleX.git
/ProjectA/ApplicationTwo/ModuleY : <SVN_REPO>/git/ProjectA/ApplicationTwo/ModuleY.git
...
CONFIGURATION SUCCESSFUL
...
This command tries to auto-detect repository layout and generate configuration file at <SVN_REPO>/conf/subgit.conf. It may take a while in case of big Subversion repository like yours.
Please make sure that auto-generated configuration file looks as follows, adjust it if necessary:
...
[git "ProjectA/ApplicationOne/ModuleX"]
translationRoot = /ProjectA/ApplicationOne/ModuleX
repository = git//ProjectA/ApplicationOne/ModuleX.git
pathEncoding = UTF-8
trunk = trunk:refs/heads/master
branches = branches/*:refs/heads/*
shelves = shelves/*:refs/shelves/*
tags = tags/*:refs/tags/*
...
Authors mapping
At this stage you have to create /conf/authors.txt file that maps existing SVN usernames to Git authors. Please refer to documentation for more details.
Installation
Finally you have to import your Subversion repository to Git and enable synchronization by running subgit install command:
subgit install repo
SubGit version <VERSION> build #<BUILD_NUMBER>
Subversion to Git mapping has been found:
/ProjectA/ApplicationOne/ModuleX : <SVN_REPO>/git/ProjectA/ApplicationOne/ModuleX.git
/ProjectA/ApplicationTwo/ModuleY : <SVN_REPO>/git/ProjectA/ApplicationTwo/ModuleY.git
...
Processing '/ProjectA/ApplicationOne/ModuleX'
Translating Subversion revisions to Git commits...
Processing '/ProjectA/ApplicationTwo/ModuleY'
Translating Subversion revisions to Git commits...
...
Subversion revisions translated: <REVISIONS_NUMBER>.
Total time: <TIME_SPENT> seconds.
INSTALLATION SUCCESSFUL
Git Server
When the installation is over and synchronization between Subversion and Git repositories is enabled, you can setup some Git server (or reuse existing Apache HTTP server). Please refer to documentation on that and see a couple of posts on this topic in our blog:
VisualSVN Server and SubGit
Gitolite and SubGit
Remote Mirror Mode
When using this mode one has to install SubGit into Git repository only and keep this repository synchronized with remote Subversion server hosted on a different machine.
Below you can find some basic instructions. Please refer to SubGit 'Remote Mode' Book for more details.
Configuration
In remote mirror mode SubGit does not try to auto-detect repository layout, so you have to run subgit configure --svn-url <SVN_URL> command for every module within Subversion repository:
subgit configure --svn-url <SVN_ROOT_URL>/ProjectA/ApplicationOne/ModuleX <GIT_ROOT_DIR>/ProjectA/ApplicationOne/ModuleX.git
SubGit version <VERSION> build #<BUILD_NUMBER>
Configuring writable Git mirror of remote Subversion repository:
Subversion repository URL : <SVN_ROOT_URL>/ProjectA/ApplicationOne/ModuleX
Git repository location : <GIT_ROOT_DIR>/ProjectA/ApplicationOne/ModuleX.git
CONFIGURATION SUCCESSFUL
...
As result SubGit generates configuration file <GIT_REPO>/subgit/config for every Git repository. For your case this configuration file should look as follows:
...
[svn]
url = <SVN_ROOT_URL>/ProjectA/ApplicationOne/ModuleX
trunk = trunk:refs/heads/master
branches = branches/*:refs/heads/*
tags = tags/*:refs/tags/*
shelves = shelves/*:refs/shelves/*
fetchInterval = 60
connectTimeout = 30
readTimeout = 60
keepGitCommitTime = false
auth = default
[auth "default"]
passwords = subgit/passwd
useDefaultSubversionConfigurationDirectory = false
subversionConfigurationDirectory = <SVN_CONFIG_DIR>
...
Authors mapping
At this stage you have to create /subgit/authors.txt file that maps existing SVN usernames to Git authors. Please refer to documentation for more details.
SVN credentials
In case you're not using file:// protocol you have to provide necessary credentials, so SubGit is able to authenticate against Subversion server. For more information on that please read corresponding chapter in SubGit Book.
We also recommend enabling pre-revprop-change hook on Subversion side which makes further installation and maintenance a bit easier, see SubGit Book.
Installation
Finally you have to import your Subversion repository to Git and enable synchronization by running subgit install command:
subgit install git
SubGit version <VERSION> build #<BUILD_NUMBER>
Translating Subversion revisions to Git commits...
Subversion revisions translated: <REVISIONS_NUMBER>.
Total time: <TIME_SPENT> seconds.
INSTALLATION SUCCESSFUL
This command also launches background process that polls SVN server and fetches new revisions when they appear there. Basically, that means that SubGit uses dedicated process for every Git repository. Sometimes it makes sense to avoid running such processes and use some job scheduler instead.
Git server
Those links I've provided above are relevant for remote mode as well.
However, if you're going to use Atlassian Stash for Git hosting, you can use SVN Mirror Plugin which is based on SubGit engine and provides some better experience with regards to UI and maintenance.
We have the following guideline which is based on our experience:
In case of many independent Subversion repositories, it's better to use SubGit in local mirror mode as it doesn't require SVN polling and maintaining additional process(es) for that.
In case of one giant Subversion repository with many modules, it's better to use remote mirror mode with file:// protocol and also adjust basic setup slightly.
It definitely doesn't make sense to run 100+ background processes in your case, instead we recommend installing additional post-commit SVN hook that checks what particular modules were modified by a given revision and then triggers synchronization for corresponding Git repositories.
If you have any other questions, feel free to ask us here at Stack Overflow, at our issue tracker or contact us via email: support#subgit.com.

Running Solr with Jetty

I'm having a little trouble understanding how Solr fits in with Jetty, and why I can't seem to get the start.jar in the distribution package to work.
I can run all of the example configurations via java -jar start.jar. However, when I try to run something like the follwing --
java -Dsolr.solr.home=/Users/jwwest/solr -jar $(brew --prefix solr)/libexec/example/start.jar
-- the following error occurs:
java.io.FileNotFoundException: No XML configuration files specified in start.config or command line.
at org.eclipse.jetty.start.Main.start(Main.java:506)
at org.eclipse.jetty.start.Main.main(Main.java:95)
I opened up the start.jar file, and there is a start.config file located inside of the jar which I'm assuming should handle this configuration for me. I'm not understanding why it will work when run from inside of the distribution examples directory, but not outside of it.
You also need to define the jetty.home property. Try:
java -Dsolr.solr.home=/Users/jwwest/solr -jar $(brew --prefix solr)/libexec/example/start.jar -Djetty.home=$(brew --prefix solr)/libexec/example
You can see the effective command line start.jar generates by using the --dry-run command line flag.
java -jar start.jar --dry-run
That will output everything with full path names so you can run it from outside the directory.
Source: http://www.eclipse.org/jetty/documentation/9.0.0.M3/advanced-jetty-start.html
The start.jar is a jetty specific mechanism that works to build out all the classpath requirements for starting up Jetty. It is generally only used in the scope of the jetty distribution. Pulling the start.jar out of the configuration and placing it somewhere else renders the default configuration of the start.config rather moot.
My understanding of Solr is that it bundles itself with a distribution of jetty, placing what it needs to run into the distribution and repackages it as its own. They may have a custom start.config file that further adds its own locations for classpath resources and the like, or not.
The exception you are seeings stems from the start.config file expecting an etc/ directory containing jetty.xml formatted xml files which are used to configure the jetty process.
Jetty being often used in an embedded format has little to do with this issue, it is simply a common use case because jetty is incredibly easy to embed into an application. Embedded instances of jetty rarely (if ever) leverage a start.jar...instead it is up to the embedding application to manage its own classpath.
First, you need to change your folder where start.jar is located, then execute the same command.
Jetty is often used as embedded container. If you want to use the jetty, then a good start would be to copy the example directory and rename it to what you want it to be. The solr directory is the one for basic configuration.
Else it is recommended to use tomcat and the solr.war file.

Resources