Idempotency in a camel application running in Kubernetes cluster - file

I am using apache camel as integration framework in my microservice. I am deploying it in a Kubernetes cloud as multiple pods. I had written a route for reading file from a directory and write to another. But I am facing an issue as the different pods are picking same file. I need to avoid that. I only want any of the pod to pick the file and process but currently all the pods are picking and processing the file. Can someone help with this. Please suggest some examples available in GitHub or any other.
Thanks in advance.

Camel recently introduced some interesting clustering capabilities - see here.
In your particular case, you could model a route which is taking the leadership when starting the directory polling, preventing thereby other nodes from picking the (same or other) files.
Set it up is very easy and all you need is to prefix singleton
endpoints according to the master component syntax:
master:namespace:delegateUri
This would result in something like this:
from("master:mycluster:file://...")
.routeId("clustered-route")
.log("Clustered file polling !");

Related

How to follow a updating local file while using flink

As mentioned in the document:
For example a data pipeline might monitor a file system directory for new files and write their data into an event log. Another application might materialize an event stream to a database or incrementally build and refine a search index.
So, how can I follow a local file system file updating while using Flink?
Here, the document also mentioned that:
File system sources for streaming is still under development. In the future, the community will add support for common streaming use cases, i.e., partition and directory monitoring.
Does this mean I could use the API to do some special streaming? If you know how to use streaming file system source, please tell me. Thanks!

Using Flink LocalEnvironment for Production

I wanted to understand the limitations of LocalExecutionEnvironment and if it can be used to run in production ?
Appreciate any help/insight. Thanks
LocalExecutionEnvironment spins up a Flink MiniCluster, which runs the entire Flink system (JobManager, TaskManager) in a single JVM. So you're limited to CPU cores and memory available on that one machine. You also don't have HA from multiple JobManagers. I haven't looked at other limitations of the MiniCluster environment, but I'm sure more exist.
A LocalExecutionEnvironment doesn't load a config file on startup, so you have to do all of the configuration in the application. By default it also doesn't offer a REST endpoint. You can solve both these issues by doing something like this:
String cwd = Paths.get(".").toAbsolutePath().normalize().toString();
Configuration conf = GlobalConfiguration.loadConfiguration(cwd);
env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
Logging may be another issue that will require a workaround.
I don't believe you'll be able to use the Flink CLI to control the job, but if you create the Web UI (as shown above) you can at least use the REST API to do things like triggering savepoints (after first using the REST API to get the job ID).

Kubernetes: How to manage data with multiple replicas?

I am currently learning Kubernetes and I'm stuck on how to handle the following situation:
I have a Spring Boot application which handles files(photos, pdf, etc...) uploaded by users, users can also download these files. This application also produces logs which are spread into 6 different files. To make my life easier I decided to have a root directory containing 2 subdirectories(1 directory for users data and 1 for logs) so the application works only with 1 directory(appData)
.appData
|__ usersData
|__ logsFile
I would like to use GKE (Google Kubernetes Engine) to deploy this application but I have these problems:
How to handle multiple replicas which will read/write concurrently data + logs in the appData directory?
Regarding logs, is it possible to have multiple Pods writing to the same file?
Say we have 3 replicas (Pod-A, Pod-B and Pod-C), if user A uploads a file handled by Pod-B, how Pod-A and Pod-C will discover this file if the same user requests later its file?
Should each replica have its own volume? (I would like to avoid this situation, which seems the case when using StatefulSet)
Should I have only one replica? (using Kubernetes will be useless in that case)
Same questions about database's replicas.
I use PostgreSQL and I have the same questions. If we have multiple replicas, as requests are randomly send to replicas, how to be sure that requesting data will return a result?
I know there a lot of questions. Thanks a lot for your clarifications.
I'd do two separate solutions for logs and for shared files.
For logs, look at a log aggregator like fluentd.
For shared file system, you want an NFS. Take a look at this example: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs. The NFS will use a persistent volume from GKE, Azure, or AWS. It's not cloud agnostic per se, but the only thing you change is your provisioner if you want to work in a different cloud.
You can use persistent volume using NFS in GKE (Google Kubernetes Engine) to share files across pods.
https://cloud.google.com/filestore/docs/accessing-fileshares

Need some information about Crawl Anywhere and solr

I gone through the crawl anywhere documentation but i am very much confuse about its installation steps.
What i understood is Apache is optional. But do need independent tomcat instance for crawl? Because what i saw in folder structure, there is tomcat folder already present and war file is also there?
Also do we need independent instance of Apache solr also ?
If we want to add postgresql database to crawl, how we can do that?
Please provide some link also so that I can go through it and clarify any doubt I have in my mind.
Apache is needed to use admin interface. Tomcat is needed for some interactive features. You can crawl without both of them.
No.
MySQL and MongoDB are supported. The code is open source, so you can add postgresql support.
Try Google Groups for other questions

Configuring logging for ActiveMQ 5.5 on Tomcat 6 with web app using SLF4j and logback

I would like my web app to log using SLF4j and logback. However, I am using ActiveMQ - which then requires that some if its jars go in /usr/share/tomcat6/lib (this is because the queues are defined outside of the web app so the classes to support them must be at container level).
ActiveMQ 5.5+ requires SLF4j-api so that jar has to go in to. Because SLF4j is now starting it needs to have a logging library added or it will simply nop. Thus, logback-core and logback-classic go in too.
After quite some frustration I got this working well enough that I can tidy it up shortly. I needed to configure logback to use a JNDI lookup to get the context. Then it can lookup logback-kenobi.xml in my web app and have a separate configuration there.
However, I'm wondering if this is the best way to do this. For one, the context handling appears not to support the groovy format. I did have a logback.groovy in my web app that logged to console when I was developing locally (which means that Eclipse WTP works nicely) but logs to file and to Splunk Storm when everywhere else. I'm going to want to do something similar with this setup but I'm not sure if I should do that by overwriting the logback-kenobi.xml or some other method.
Note that I don't, currently, need Tomcat itself to log with slf4j although I am planning to do that. Nor do I really need ActiveMQ to log with slf4j but I did need it to stop spewing debug messages every 30s as it was doing. I am aware of tomcat-slf4j-logbak but I don't believe it is directly useful as it is ActiveMQ requiring logging which is the issue.
However, I'm wondering if this is the best way to do this.
Best is an opinion, working is a fact.

Resources