Using another FileSystem configuration while creating a job - apache-flink

Summary
We are currently facing an issue with the FileSystem abstraction in Flink. We have a job that can dynamically connect to an S3 source (meaning it's defined at runtime).
We discovered a bug in our code, and it could be due to a wrong assumption on the way the FileSystem works.
Bug explanation
During the initialization of the job, (so in the job manager) we manipulate the FS to check that some files exist in order to fail gracefully before the job is executed.
In our case, we need to set dynamically the FS. It can be either HDFS, S3 on AWS or S3 on MinIO.
We want the FS configuration to be specific for the job, and different from the cluster one (different access key, different endpoint, etc.).
Here is an extract of the code we are using to do so:
private void validateFileSystemAccess(Configuration configuration) throws IOException {
// Create a plugin manager from the configuration
PluginManager pluginManager = PluginUtils.createPluginManagerFromRootFolder(configuration);
// Init the FileSystem from the configuration
FileSystem.initialize(configuration, pluginManager);
// Validate the FileSystem: an exception is thrown if FS configuration is wrong
Path archiverPath = new Path(this.archiverPath);
archiverPath.getFileSystem().exists(new Path("/"));
}
After starting that specific kind of job, we notice that:
the checkpointing does not work for this job, it throws a credential error.
the job manager cannot upload the artifacts needed by the history server for all jobs already running of all kind (not only this specific kind of job).
If we do not deploy that kind of job, the upload of artifacts and the checkpointing work as expected on the cluster.
We think that this issue might come from the FileSystem.initialize() that overrides the configuration for all the FileSystems. We think that because of this, the next call to FileSystem.get() returns the FileSystem we configured in validateFileSystemAccess instead of the cluster configured one.
Questions
Could our hypothesis be correct? If so, how could we provide a specific configuration for the FileSystem without impacting the whole cluster?

Related

How to provide KafkaSource SSL files to Flink worker nodes

I am creating a Kafka-based Flink streaming application, and am trying to create an associated KafkaSource connector in order to read Kafka data.
For example:
final KafkaSource<String> source = KafkaSource.<String>builder()
// standard source builder setters
// ...
.setProperty(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "truststore.jks")
.build();
The truststore.jks file is created locally on the job manager node before the application is executed, and I've verified that it exists and is correctly populated. My problem is that, in a distributed Flink application, this truststore.jks does not automatically also exist on the task worker nodes, so the above code results in a FileNotFoundException when executed.
What I've tried:
Use env.registerCacheFile and getRuntimeContext().getDistributedCache().getFile() in order to distribute the file to all nodes, but since the graph is being built and the application is not yet running, the RuntimeContext is not available at this stage.
Supply a base64 parameter representation of the truststore, and manually convert it to .jks format. I'd need some sort of "pre-initialization" KafkaSource hook to do this, and haven't found any such functionality in the docs.
Use an external data store, such as s3, and retrieve the file from there. As far as I can tell, the internal Kafka consumer does not support non-local filesystems, so I'd still need some pre-initialization way to retrieve the file locally on each task node.
What is the best way to make this file available to task worker nodes during the source initialization?
I have read similar questions posted here before:
how to distribute files to worker nodes in apache flink
As explained above, I don't have access to the RuntimeContext at this point in the application.
Flink Kafka Connector SSL Support
This injects the truststore as a base64 encoded string parameter. I could do this, but since the internal Kafka consumer expects a file, I would have the problem of converting the parameter to .jks format before consumer initialization. I don't see a way of registering a "pre-initialization" hook for the KafkaSource in the docs.
Update:
I was able to work around this issue by instead using the ssl.truststore.certificates configuration field. This allows me to supply a base64-encoded representation of the underlying truststore.jks certificate instead of a local file path.
[I also had to update my kafka-clients dependency to 2.7.x+ as this configuration is not available in older versions of the library]

What are these autogenerated Db files?

I'm developing a grails app that uses reentrant locking and database locking. I just noticed some additional files that were autogenerated. I am wondering what they are. They are called:
devDb.h2.db
devDb.lock.db
devDb.trace.db
There is also a set for test configuration:
testDb.h2.db
testDb.trace.db
I am assumming that *Db.h2.db is just my database (set to be a file rather than in memory in my DataSource.groovy). But what about the other ones?
devDb.h2.db is the database itself. ( devDb.mv.db for the newer version )
devDb.lock.db is a lock file. H2 allows multiple processes to share the database and this file is used to coordinate access. When the database shuts down cleanly, this file should be removed automatically.
devDb.trace.db is just a log for inspecting or debugging H2. Whether or not it's created and how much detail gets logged can be controlled by adding a TRACE_LEVEL_FILE parameter to the JDBC URL.
testDb.h2.db and testDb.trace.db are the same as the devDb counterparts that are used in the test environment (e.g. grails test-app).

How to determine at runtime if I am connected to production database?

OK, so I did the dumb thing and released production code (C#, VS2010) that targeted our development database (SQL Server 2008 R2). Luckily we are not using the production database yet so I didn't have the pain of trying to recover and synchronize everything...
But, I want to prevent this from happening again when it could be much more painful. My idea is to add a table I can query at startup and determine what database I am connected to by the value returned. Production would return "PROD" and dev and test would return other values, for example.
If it makes any difference, the application talks to a WCF service to access the database so I have endpoints in the config file, not actual connection strings.
Does this make sense? How have others addressed this problem?
Thanks,
Dave
The easiest way to solve this is to not have access to production accounts. Those are stored in the Machine.config file for our .net applications. In non-.net applications this is easily duplicated, by having a config file in a common location, or (dare I say) a registry entry which holds the account information.
Most of our servers are accessed through aliases too, so no one really needs to change the connection string from environment to environment. Just grab the user from the config and the server alias in the hosts file points you to the correct server. This also removes the headache from us having to update all our config files when we switch db instances (change hardware etc.)
So even with the click once deployment and the end points. You can publish the a new endpoint URI in a machine config on the end users desktop (I'm assuming this is an internal application), and then reference that in the code.
If you absolutely can't do this, as this might be a lot of work (last place I worked had 2000 call center people, so this push was a lot more difficult, but still possible). You can always have an automated build server setup which modifies the app.config file for you as a last step of building the application for you. You then ALWAYS publish the compiled code from the automated build server. Never have the change in the app.config for something like this be a manual step in the developer's process. This will always lead to problems at some point.
Now if none of this works, your final option (done this one too), which I hated, but it worked is to look up the value off of a mapped drive. Essentially, everyone in the company has a mapped drive to say R:. This is where you have your production configuration files etc. The prod account people map to one drive location with the production values, and the devs etc. map to another with the development values. I hate this option compared to the others, but it works, and it can save you in a pinch with others become tedious and difficult (due to say office politics, setting up a build server etc.).
I'm assuming your production server has a different name than your development server, so you could simply SELECT ##SERVERNAME AS ServerName.
Not sure if this answer helps you in a assumed .net environment, but within a *nix/PHP environment, this is how I handle the same situation.
OK, so I did the dumb thing and released production code
There are a times where some app behavior is environment dependent, as you eluded to. In order to provide this ability to check between development and production environments I added the following line to global /etc/profile/profile.d/custom.sh config (CentOS):
SERVICE_ENV=dev
And in code I have a wrapper method which will grab an environment variable based on name and localize it's value making it accessible to my application code. Below is a snippet demonstrating how to check the current environment and react accordingly (in PHP):
public function __call($method, $params)
{
// Reduce chatter on production envs
// Only display debug messages if override told us to
if (($method === 'debug') &&
(CoreLib_Api_Environment_Package::getValue(CoreLib_Api_Environment::VAR_LABEL_SERVICE) === CoreLib_Api_Environment::PROD) &&
(!in_array(CoreLib_Api_Log::DEBUG_ON_PROD_OVERRIDE, $params))) {
return;
}
}
Remember, you don't want to pepper your application logic with environment checks, save for a few extreme use cases as demonstrated with snippet. Rather you should be controlling access to your production databases using DNS. For example, within your development environment the following db hostname mydatabase-db would resolve to a local server instead of your actual production server. And when you push your code to the production environment, your DNS will correctly resolve the hostname, so your code should "just work" without any environment checks.
After hours of wading through textbooks and tutorials on MSBuild and app.config manipulation, I stumbled across something called SlowCheetah - XML Transforms http://visualstudiogallery.msdn.microsoft.com/69023d00-a4f9-4a34-a6cd-7e854ba318b5 that did what I needed it to do in less than hour after first stumbling across it. Definitely recommended! From the article:
This package enables you to transform your app.config or any other XML file based on the build configuration. It also adds additional tooling to help you create XML transforms.
This package is created by Sayed Ibrahim Hashimi, Chuck England and Bill Heibert, the same Hashimi who authored THE book on MSBuild. If you're looking for a simple ubiquitous way to transform your app.config, web.config or any other XML fie based on the build configuration, look no further -- this VS package will do the job.
Yeah I know I answered my own question but I already gave points to the answer that eventually pointed me to the real answer. Now I need to go back and edit the question based on my new understanding of the problem...
Dave
I' assuming yout production serveur has a different ip address. You can simply use
SELECT CONNECTIONPROPERTY('local_net_address') AS local_net_address

Scala/Lift Database Connections

I am working on a new web application using Scala with Lift. I want to make it reusable so others might install it on their own servers for their own needs. I come out of a PHP background where it is common practice to create an install form asking for database connection details. This information is stored in a configuration file and used by the rest of the PHP application for connecting to the database. It is extremely convenient for the user because everything is contained within the directory storing the PHP files. They are free to define everything else. My Java/Scala background has all been enterprise work where an application was only intended to run on the database we setup for it. It was not meant to be installed on others' web servers or with different databases.
So my question is how is this typically done for the Java/Scala world? If there are open source applications implementing the mainstream solution, feel free to point me to those too.
I use this to set up the database:
val vendor =
new StandardDBVendor(
Props.get("db.driver") openOr "org.h2.Driver",
Props.get("db.url") openOr "jdbc:h2:mem:db;AUTO_SERVER=TRUE",
Props.get("db.user"),
Props.get("db.password"))
LiftRules.unloadHooks.append(vendor.closeAllConnections_! _)
DB.defineConnectionManager(DefaultConnectionIdentifier, vendor)
The 'Props' referred to will then be (by default) in the file default.props in the props directory in resources.
Updated: This is what I do on servers in production. With 'Props.whereToLook' you provide a function that retrieves an input stream of the configuration. This can be a file as in the example below or you could for example fetch it over network socket.
You will probably let the application to fail with an error dialog.
val localFile = () => {
val file = new File("/opt/jb/myapp/etc/myapp.props")
if (file.exists) Full(new FileInputStream(file)) else Empty
}
Props.whereToLook = () => (("local", localFile) :: Nil)
I am not sure if I am missing your points.
By default, Lift use Scala source file(Boot.scala) to configure all the settings, because Lift doesn't wanna introduce other language into the framework, however you can override some of the configurations using a .properties file.
In Java/Scala world, we use .properties file. It's just a plain text file used for configuration or localization etc,just like text configuration files in PHP.
Lift Framework has it's default support for the external database configuration files, you check out the code in Boot.scala, that's if a .properties file existed, the database will initialized using the connection configuration, if it doesn't, it will use the source file configuration.

JSP website pre-database configuration

I'm working on a website in JSP (in GWT really, but on the server side, it's really just JSP), and I need to configure my database.
I know HOW to code in the database connection etc, but i'm wondering how/where the database config should be saved.
To clarify my doubt, let me give an example; in PHP, a website usualy has a config.php, where the user configures the database, user, etc (or an install.php generates it).
However, since JSP is bytecode, I can't code this info into my site and have the user modify it, nor can I modify it analogously to an install.php.
How should I handle this? what's the best/most common practice ? I've found NO examples of this. Mainly, where should the config file be stored?
There are several possibilities to do this, what I've seen done include:
Having database credentials in a special file, usually db.properties or some simple XML file that contain the required information (driver, url, username, password, any ORM parameters if needed). The properties file would be placed under WEB-INF or WEB-INF/classes; the downside of this approach is that the user would have to modify the file inside the WAR before deploying it to the application server.
Acquire the database connection via JNDI and expect it to be provided by the application server. This seems to be the most common way of doing this; on the upside, your WAR doesn't have to be changed, however, the downside is that configuring a JNDI data source is different for every application server and may be confusing if your system administrators are not experienced with Java technology.

Resources