How to follow a updating local file while using flink - apache-flink

As mentioned in the document:
For example a data pipeline might monitor a file system directory for new files and write their data into an event log. Another application might materialize an event stream to a database or incrementally build and refine a search index.
So, how can I follow a local file system file updating while using Flink?
Here, the document also mentioned that:
File system sources for streaming is still under development. In the future, the community will add support for common streaming use cases, i.e., partition and directory monitoring.
Does this mean I could use the API to do some special streaming? If you know how to use streaming file system source, please tell me. Thanks!

Related

Idempotency in a camel application running in Kubernetes cluster

I am using apache camel as integration framework in my microservice. I am deploying it in a Kubernetes cloud as multiple pods. I had written a route for reading file from a directory and write to another. But I am facing an issue as the different pods are picking same file. I need to avoid that. I only want any of the pod to pick the file and process but currently all the pods are picking and processing the file. Can someone help with this. Please suggest some examples available in GitHub or any other.
Thanks in advance.
Camel recently introduced some interesting clustering capabilities - see here.
In your particular case, you could model a route which is taking the leadership when starting the directory polling, preventing thereby other nodes from picking the (same or other) files.
Set it up is very easy and all you need is to prefix singleton
endpoints according to the master component syntax:
master:namespace:delegateUri
This would result in something like this:
from("master:mycluster:file://...")
.routeId("clustered-route")
.log("Clustered file polling !");

Dynamic Job Creation and Submission to Flink

Hi I am planning to use flink as a backend for my feature where we will show a UI to user to graphically create event patterns for eg: Multiple login failures from the same Ip address.
We will create the flink pattern programmatically using the given criteria by the user in the UI.
Is there any documentation on how to dynamically create the jar file and dynamically submit the job with it to flink cluster?
Is there any best practice for this kind of use case using apache flink?
The other way you can achieve that is that you can have one jar which contains something like an “interpreter” and you will pass to it the definition of your patterns in some format (e.g. json). After that “interpreter” translates this json to Flink’s operators. It is done in such a way in https://github.com/TouK/nussknacker/ Flink’s based execution engine. If you use such an approach you will need to handle redeployment of new definition in your own application.
One straightforward way to achieve this would be to generate a SQL script for each pattern (using MATCH_RECOGNIZE) and then use Ververica Platform's REST API to deploy and manage those scripts: https://docs.ververica.com/user_guide/application_operations/deployments/artifacts.html?highlight=sql#sql-script-artifacts
Flink doesn't provide tooling for automating the creation of JAR files, or submitting them. That's the sort of thing you might use a CI/CD pipeline to do (e.g., github actions).
Disclaimer: I work for Ververica.

How to automate an upload process with talend whenever a file is moved into a specific folder

I have zero experience with ETL.
Whenever a file(a .csv) is moved into a specific folder, it should be uploaded to SalesForce I don't know how to get this automated flow.
I hope I was clear enough.
I gotta use the opensource version, any helpful links or resources will be appreciated.
Thank you in advance
You could definitely use Talend Open Studio for ESB : this studio contains 'Routes' functionalities : you'll be able to use a cFile component, which will check your folder for new files, and raise an event that will propagate throughout the route to a designed endpoint (for example a salesForce API). Talend ESB maps Apache Camel components , which are well documented.
Check about Routes with Talend for ESB, it should do the trick.
We have tFileExists component, you can use that and configure to check the file.
Also you have tFileWait component, where you can defile the frame of the arrival of he files and the number of iterations it has to check the file.
But i would suggest id you have any scheduling tool, use file watcher concept and then use talend job to upload the file to a specific location.
Using talend itself to check the file arrival is not a feasible way as the jobs has be in running state continuously which consumes more java resource

Version Control Systems, lock files on read

I'm in a company which demands simultaneously collaboration on a web application, to avoid any conflict during development our CEO suggests the use of SVN version control system so each user can lock the file when they're working on it.
SVN does support locking files for each user but the problem is we need to lock the file on reading, SVN as I know to make you check for any updates every time you want to see the locking status on files. What we need is to not letting other users even open/read the files that have been locked! does SVN have such a feature?! or is there any other technologies like google drive to achieve such a mechanism?
Thank you all.

Is there a simple way to use the filesystem using the JCR API?

I have an OSGi-based, server side application that uses the filesystem to store scripts and configuration data.
In time, I'd like to move that application to 'the cloud', and that's not going to work well with its current dependency to file system access.
What I'd like to do is insert a JCR layer into this application, so it will still work in the current situation (regular files on the local filesystem), but will pave the way to a cloud situation.
I did find a file connector in modeshape, but I ran into a pretty severe incompatibility with OSGi, which hasn't been fixed. Besides, ModeShape pulls in LOTS of dependencies (about 6 MB, I think), which is a problem for me.
So I don't see any options besides starting to hack my own JCR implementation, which I am reluctant to do.
Any ideas?
Although you wouldn't be using JCR directly, using the Apache Sling ResourceProvider mechanism should allow you to move easily from filesystem to something else later, and it's OSGi-friendly as Sling is 100% based on OSGi.
You could start now by using Sling's Filesystem resource provider ( http://sling.apache.org/site/accessing-filesystem-resources-extensionsfsresource.html ) and later move to your own custom ResourceProvider, as needed.
The source code of the filesystem provider is at https://svn.apache.org/repos/asf/sling/trunk/bundles/extensions/fsresource - it's quite simple code that can be used as an example for creating your own ResourceProvider.
For your custom system the question would be how many Sling bundles you need to get that working - I don't know off the top of my head but would suggest using the Sling Launchpad to find out, it launches a vanilla Sling system with lots of bundles that you won't need, but you could try reducing it to the minimum that still allows the ResourceProvider mechanism to work.
You can also use Apache Commons VFS2, there is for example a JCR connector, or you can use webdav or a JDBC table. I use this in a commercial project on top of an atomic (git like) tree on top of a shared JDBC table.

Resources