How to provide KafkaSource SSL files to Flink worker nodes - apache-flink

I am creating a Kafka-based Flink streaming application, and am trying to create an associated KafkaSource connector in order to read Kafka data.
For example:
final KafkaSource<String> source = KafkaSource.<String>builder()
// standard source builder setters
// ...
.setProperty(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "truststore.jks")
.build();
The truststore.jks file is created locally on the job manager node before the application is executed, and I've verified that it exists and is correctly populated. My problem is that, in a distributed Flink application, this truststore.jks does not automatically also exist on the task worker nodes, so the above code results in a FileNotFoundException when executed.
What I've tried:
Use env.registerCacheFile and getRuntimeContext().getDistributedCache().getFile() in order to distribute the file to all nodes, but since the graph is being built and the application is not yet running, the RuntimeContext is not available at this stage.
Supply a base64 parameter representation of the truststore, and manually convert it to .jks format. I'd need some sort of "pre-initialization" KafkaSource hook to do this, and haven't found any such functionality in the docs.
Use an external data store, such as s3, and retrieve the file from there. As far as I can tell, the internal Kafka consumer does not support non-local filesystems, so I'd still need some pre-initialization way to retrieve the file locally on each task node.
What is the best way to make this file available to task worker nodes during the source initialization?
I have read similar questions posted here before:
how to distribute files to worker nodes in apache flink
As explained above, I don't have access to the RuntimeContext at this point in the application.
Flink Kafka Connector SSL Support
This injects the truststore as a base64 encoded string parameter. I could do this, but since the internal Kafka consumer expects a file, I would have the problem of converting the parameter to .jks format before consumer initialization. I don't see a way of registering a "pre-initialization" hook for the KafkaSource in the docs.

Update:
I was able to work around this issue by instead using the ssl.truststore.certificates configuration field. This allows me to supply a base64-encoded representation of the underlying truststore.jks certificate instead of a local file path.
[I also had to update my kafka-clients dependency to 2.7.x+ as this configuration is not available in older versions of the library]

Related

Using another FileSystem configuration while creating a job

Summary
We are currently facing an issue with the FileSystem abstraction in Flink. We have a job that can dynamically connect to an S3 source (meaning it's defined at runtime).
We discovered a bug in our code, and it could be due to a wrong assumption on the way the FileSystem works.
Bug explanation
During the initialization of the job, (so in the job manager) we manipulate the FS to check that some files exist in order to fail gracefully before the job is executed.
In our case, we need to set dynamically the FS. It can be either HDFS, S3 on AWS or S3 on MinIO.
We want the FS configuration to be specific for the job, and different from the cluster one (different access key, different endpoint, etc.).
Here is an extract of the code we are using to do so:
private void validateFileSystemAccess(Configuration configuration) throws IOException {
// Create a plugin manager from the configuration
PluginManager pluginManager = PluginUtils.createPluginManagerFromRootFolder(configuration);
// Init the FileSystem from the configuration
FileSystem.initialize(configuration, pluginManager);
// Validate the FileSystem: an exception is thrown if FS configuration is wrong
Path archiverPath = new Path(this.archiverPath);
archiverPath.getFileSystem().exists(new Path("/"));
}
After starting that specific kind of job, we notice that:
the checkpointing does not work for this job, it throws a credential error.
the job manager cannot upload the artifacts needed by the history server for all jobs already running of all kind (not only this specific kind of job).
If we do not deploy that kind of job, the upload of artifacts and the checkpointing work as expected on the cluster.
We think that this issue might come from the FileSystem.initialize() that overrides the configuration for all the FileSystems. We think that because of this, the next call to FileSystem.get() returns the FileSystem we configured in validateFileSystemAccess instead of the cluster configured one.
Questions
Could our hypothesis be correct? If so, how could we provide a specific configuration for the FileSystem without impacting the whole cluster?

Custom deserialiser using the universal Kafka ingest

Following on from this (apologies, had a different user): Kafka Key access on Ingress of a Python Flink Stateful function
Our use case is that we make use of the Kafka headers as a means of tracing and lineage as well as required metadata. Looking at this:
https://github.com/apache/flink-statefun/blob/master/statefun-flink/statefun-flink-io-bundle/src/main/java/org/apache/flink/statefun/flink/io/kafka/binders/ingress/v1/RoutableKafkaIngressDeserializer.java#L45-L61 It looks like using the standard deserializer, the headers are dropped.
Effectively what I'd want, is a way to inject my own deserializer that would return a message containing this and any other metadata from the record. I'd want to add something like the UniversalKafkaIngress so that I could configure it using a remote module.
Looking at the code, I can see that I could register a new ExtensionModule, and replace the deserializer (and create a custom kind). Is this recommended? If so - are there any docs on this (if not, how could I configure statefun to pick this up)?
Or, is there another preferred method?
Thanks again...
Ah - found out where I was going wrong.
You can load a ExtensionModule using the standard module SPI process - and therefore register it as a new 'universal' ingress, so that it can be loaded remotely. I had a typo - which is why I battled.
There are a few gotchas - and I'll post a gist a little later to show how it can be done.

Flush Agent doesnt clear proxy clientlib paths

I am using AEM 6.3 and using allowProxy for clientlibs. As expected dispatcher caches the clientlibs under path /cache/etc.clientlibs/myapp/clientlibs/clientlib.css. But corresponding jcr path will be /apps/myapp/clientlibs/clientlib/mystyle.css
So when clientlibs are modified during deployment, and published, they wont clear respective apache cache automatically. Today we are doing this manually.
Plus we use automated cache buster VersionedClientlibs. So we never end up loading obsolete clientlib. But apache cache gets piled up with 1000s of obsolete clientlib files if manual clearance is not done.
What is the recommended approach to clear obsolete clientlibs at apache that is versioned and proxy allowed?
This a known limitation, and we've also been flushing the whole /etc.clientlib path after each deployment. we do this via ACS dispatcher-flush-ui.
Typically, when deploying to production, you'd flush the whole or part of the dispatcher cache anyway to make sure component changes reflect. So adding this task to that process is easy.
If you really want this to become an automatic process, you can:
Write a ResourceChangeListener example here or a a JCR EventListener example: here. And basically listen for changes at the clientlib path and replicate the corresponding /etc.clientlibs/ path
Write a ReplicationPathTransformer so that when a your clientlib path is replicated, you can transform it to the corresponding /etc.clientlib/ path to be flushed in dispatcher.
Hope this helps.

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

Parse a log4j log file

We have several applications that use log4j for logging. I need to get a log4j parser working so we can combine multiple log files and run automated analysis on them. I'm not looking to reinvent the wheel, so can someone point me to a decent pre-existing parser? I do have the log4j conversion pattern if that helps.
If not, I'll have to roll our own.
I didn't realize that Log4J ships with an XML appender.
Solution was: specify an XML appender in the logging configuration file, include that output XML file as an entity into a well formed XML file, then parse the XML using your favorite technique.
The other methods had the following limitations:
Apache Chainsaw - not automated enough
jdbc - poor performance in a high performance distributed app
You can use OtrosLogViewer with batch processing. You have to:
Define you log format, you can use Log4j pattern layout parser or Log4j XmlLayout
Create java class that implements LogDataParsedListener. Method public void logDataParsed(LogData data, BatchProcessingContext context) will be called on every parsed log event.
Create jar
Run OtrosLogViewer with specifying your log processing jar, LogDataParsedListener implementation and log files.
What you are looking for is called SawMill, or something like it.
Log4j log files aren't really suitable for parsing, they're too complex and unstructured. There are third party tools that can do it, I believe (e.g. Sawmill).
If you need to perform automated, custom analysis of the logs, you should consider logging to a database, and analysing that. JDBC ships with the JdbcAppender which appends all messages to a database of your choice, but it has performance implications, and it's a bit flaky. There are other, similar, alternatives on the interweb, though (like this one).
You -can- use Log4j's Chainsaw V2 to process the various log files and collect them into one table, and either output those events as xml or use Chainsaw's built-in expression-based filtering, searching & colorizing support to slice & dice the logs.
Steps:
- Start Chainsaw V2
- Create a chainsaw configuration file by copying the example configuration file available from the Welcome tab - define one LogFilePatternReceiver 'plugin' entry for each log file that you want to process
- Start Chainsaw with that configuration
- Each log file will end up as a separate tab in the UI
- Pause the chainsaw-log tab and clear the events from that tab
- Create a new tab which aggregates the events from the various tabs by going to the 'view, crate custom expression logpanel' menu item and enter 'level >= DEBUG' in the box. It will create a new tab containing events from all of the tabs with level >= debug (which is why you cleared the chainsaw-log tab).
You can get an overview of the expression syntax used to filter, colorize and search from the tutorial (available from the Help menu).
If you don't want to use Chainsaw, you can do something similar - start a simple app that doesn't log but loads a log4j.xml config file with the 'plugin' entries you defined for the Chainsaw configuration, but also define a FileAppender with an xmllayout - all of the events received by the 'receivers' will be sent to the single appender.

Resources