Creating new Files after a specific period of time Apache Camel File - file

I have this camel route that consumes from Kafka in a constant flow. Is there some way to define that in each 10 min, for example,
a new file be created? I need that each file has different names since when a file cannot have the same name as other. I've tried with timer and quartz camel component without success.
from("kafka:testTopic?brokers=localhost:9092&groupId=01")
.to("file:C://DESTINATION?fileExist=Append&fileName=prefixName_${variable}.json");

Related

How to provide KafkaSource SSL files to Flink worker nodes

I am creating a Kafka-based Flink streaming application, and am trying to create an associated KafkaSource connector in order to read Kafka data.
For example:
final KafkaSource<String> source = KafkaSource.<String>builder()
// standard source builder setters
// ...
.setProperty(SslConfigs.SSL_TRUSTSTORE_LOCATION_CONFIG, "truststore.jks")
.build();
The truststore.jks file is created locally on the job manager node before the application is executed, and I've verified that it exists and is correctly populated. My problem is that, in a distributed Flink application, this truststore.jks does not automatically also exist on the task worker nodes, so the above code results in a FileNotFoundException when executed.
What I've tried:
Use env.registerCacheFile and getRuntimeContext().getDistributedCache().getFile() in order to distribute the file to all nodes, but since the graph is being built and the application is not yet running, the RuntimeContext is not available at this stage.
Supply a base64 parameter representation of the truststore, and manually convert it to .jks format. I'd need some sort of "pre-initialization" KafkaSource hook to do this, and haven't found any such functionality in the docs.
Use an external data store, such as s3, and retrieve the file from there. As far as I can tell, the internal Kafka consumer does not support non-local filesystems, so I'd still need some pre-initialization way to retrieve the file locally on each task node.
What is the best way to make this file available to task worker nodes during the source initialization?
I have read similar questions posted here before:
how to distribute files to worker nodes in apache flink
As explained above, I don't have access to the RuntimeContext at this point in the application.
Flink Kafka Connector SSL Support
This injects the truststore as a base64 encoded string parameter. I could do this, but since the internal Kafka consumer expects a file, I would have the problem of converting the parameter to .jks format before consumer initialization. I don't see a way of registering a "pre-initialization" hook for the KafkaSource in the docs.
Update:
I was able to work around this issue by instead using the ssl.truststore.certificates configuration field. This allows me to supply a base64-encoded representation of the underlying truststore.jks certificate instead of a local file path.
[I also had to update my kafka-clients dependency to 2.7.x+ as this configuration is not available in older versions of the library]

Consume multiple text files with Apache Flink DataSet API

I am writing a batch job with Apache Flink using the DataSet API. I can read a text file using readTextFile() but this function just read one file at once.
I would like to be able to consume all the text files in my directory one by one and process them at the same time one by one, in the same function as a batch job with the DataSet API, if it is possible.
Other option is implement a loop doing multiple jobs, one for each file, instead of one job, with multiples files. But I think this solution is not the best.
Any suggestion?
If I got the documentation right you can read an entire path using ExecutionEnvironment.readTextFile(). You can find an example here: Word-Count-Batch-Example
References:
Flink Documentation
Flink Sources

Camel SFTP fetch on schedule and on demand

I can see similar problems in different variations but haven't managed to find a definite answer.
Here is the usecase:
SFTP server that I want to poll from every hour
on top of that, I want to expose a REST endpoint that the user can hit do force an ad-hoc retrieval from that same SFTP. I'm happy with the schedule on the polling to remain as-is, i.e. if I polled, 20 mins later the user forces refresh, the next poll can be 40 mins later.
Both these should be idempotent in that a file that was downloaded using the polling mechanism should not be downloaded again in ad-hoc pull and vice-versa. Both ways of accessing should download ALL the files available that were not yet downloaded (there will likely be more than one new file - I saw a similar question here for on-demand fetch but it was for a single file).
I would like to avoid hammering the SFTP via pollEnrich - my understanding is that each pollEnrich would request a fresh list of files from SFTP, so doing pollEnrich in a loop until all files are retrieved would be calling the SFTP multiple times.
I was thinking of creating a route that will start/stop a separate route for the ad-hoc fetch, but I'm not sure that this would allow for the idempotent behaviour between routes to be maintained.
So, smart Camel brains out there, what is the most elegant way of fulfilling such requirements?
Not a smart camel brain, but I would give a try as per my understanding.
Hope, you already went through:
http://camel.apache.org/file2.html
http://camel.apache.org/ftp2.html
I would have created a filter, separate routes for consumer and producer.
And for file options, I would have used: idempotent, delay, initialDelay, useFixedDelay=true, maxMessagesPerPoll=1, eagerMaxMessagesPerPoll as true, readLock=idempotent, idempotent=true, idempotentKey=${file:onlyname}, idempotentRepository, recursive=false
- For consuming.
No files will be read again! You can use a diversity of options as documented and try which suits you the best, like delay option. If yo
"I would like to avoid hammering the SFTP via pollEnrich - my understanding is that each pollEnrich would request a fresh list of files from SFTP, so doing pollEnrich in a loop until all files are retrieved would be calling the SFTP multiple times." - > Unless you use the option disconnect=true, the connection will not be terminated and you can either consume or produce files continously, check ftp options for disconnect and disconnectOnBatchComplete.
Hope this helps!

Reading file content using file name from db in camel

I have to do a query in db to get the filenames from a table. And then I have to read the contents of files in a folder/directory using the file names I got from query. I have done the query part and stored the list of filenames in Exchange using a bean. But I am wondering how can I use this filenames in exchange to read the file contents. Could you please help?
You can use the pattern Content Enricher (http://camel.apache.org/content-enricher.html) and Camel 2.16 with dynamic endpoints to load the contents of a file, by path previously obtained from the database.
UPDATED
You have to use the pollEnrich (because file component is a polling consumer) to consume files from uri and you can use an expression (such as Simple) to configure the fileName.
You can try something like that (only for Camel version greater than 2.16):
.pollEnrich("file:?fileName=${header.FILE_NAME}", 1000, new YourAggregationStrategy())

i want to process the same file multiple times in apache camel2

My requirement is to process the same file multiple times in Apache camel 2. how can i achieve this?
if I use noop=true then idempotent will be set to true. Otherwise file will be moved to the other directory.

Resources