Camel for multiple files processing

Camel for multiple files processing - apache-camel

I am a new at Camel. I am going to have a file processing with camel but I haven't found a ready solution for my case. I have to process multiple files together in case they exist. These files are uploaded to specific folder with some delays(Example: we have two files A.csv and B.csv, and A.csv is uploaded 10 sec later than B.csv and vice versa). Also if one file is absent more than specific time I need to process only a one file. Could anybody help me with choice a pattern ? As I understand I can use the camel filter to be sure that we already have these two files A.csv and B.csv and only then start processing, but it doesn't resolve my problem.

This is Aggregator EIP.
from("file:inputFolder")
.aggregate(constant(true), AggregationStrategies.groupedExchange())
.completionSize(2) //Wait for two files
.completionTimeout(60000) //Or process single file, if completionSize was not fulfilled within one minute
.to("log:do_something") //Here you can access List<Exchange> from message body
To group messages you can use correlation Expression. For your example (group messages by filename prefix before _) it could be something like this:
private final Expression CORRELATION_EXPRESSION = new Expression() {
#Override
public <T> T evaluate(Exchange exchange, Class<T> type) {
final String fileName = exchange.getIn().getHeader(Exchange.FILE_NAME, String.class);
final String correlationExpression = fileName.substring(0, fileName.indexOf('_'));
return exchange.getContext().getTypeConverter().convertTo(
type,
correlationExpression
);
}
};
And pass it to Aggregator:
from("file:inputDirectory")
.aggregate(CORRELATION_EXPRESSION, AggregationStrategies.groupedExchange())
...
See this gist for full example https://gist.github.com/bedlaj/a2a56aa9291bced8c0a8edebacaf22b0

Related

Emitting the "Side Outputs" and "process output" in single sink with different path

How to emit the "Side Outputs" and "process output" using single sink. Here, in this case, both output needs emit to single sink and based on the tag folder path differs
Eg
OutputTag<String> outputTag = new OutputTag<String>("side-output") {};
SingleOutputStreamOperator<String> mainDataStream = source.process(new ProcessFunction<String, String>() {
#Override
public void processElement(String value, Context ctx, Collector<String> out) {
try {
builder.parse(new InputSource(new StringReader(value)));
out.collect(value);
} catch (SAXException | IOException e) {
ctx.output(outputTag, value);
}
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
Is there any other better solution? Just worried about performance

If you want to use a single sink, you can add an attribute into your output format and use the attribute to identify the data source in the single sink.
You can also construct two sinks with different parameters to receive data from different sources. In my opinion, without considering the database you use, this kind of multi-thread way has better performance.

Flink's BucketingSink can use a Bucketer to determine which sub-directory inside of the base directory will be used. So you can use this to set the sub-directory based on an attribute in your record being written.
As far as using a single sink, since both the main output and the side output of your function are String objects (same type), you can mainDataStream.union(sideOutputStream) the two streams together before outputting the result.

Camel: how to aggregate files based on exchange in pattern

I have a class to run my route; The input comes from a queue (which is filled by a route that does a query and inserts the rows as messages on the queue)
These messages each contain a few headers:
- pdu_id, basically a prefetch on the filename.
- pad: the path the files reside in
What is to happen: I want the files in the path named by their "pdu_id".* in a tar; After that a REST call is to be done to remove the documents source.
I know a route has a from; but basically I need a route with a dynamic "from", and as below code example shows, queueing froms doesn't do the trick.
The question is what to use instead; I could not find a similar thing, but it can be I didn't use the right google search; in which case I'm deeply sorry.
public class ToDeleteTarAndDeleteRoute extends RouteBuilder {
#Override
public void configure() throws Exception
{
from("broker1:todelete.message_ids.queue")
.from("file:///?fileName=${in.header.pad}${in.header.pdu_id}.*")
.aggregate(new TarAggregationStrategy())
.constant(true)
.completionFromBatchConsumer()
.eagerCheckCompletion()
.to("file:///?fileName=${in.header.pad}${in.header.pdu_id}.tar")
.log("${header.pdu_id} tarred")
.setHeader(Exchange.HTTP_METHOD, constant("DELETE"))
.setHeader("Connection", constant("Close"))
.enrich()
.simple("http:127.0.0.1/restfuldb${header.pdu_id}?httpClient.authenticationPreemptive=true")
.log("${header.pdu_id} tarred and deleted.");
}
}

Yes. Poll enrich can help you in doing it. You should use it something like this:
from("broker1:todelete.message_ids.queue")
.aggregationStrategy(new TarAggregationStrategy())
.pollEnrich()
.simple("file:///?fileName=${in.header.pad}/${in.header.pdu_id}.*")
.unmarshal().string()
.to("file:///?fileName=${in.header.pad}/${in.header.pdu_id}.tar")
.log("${header.pdu_id} tarred")
.setHeader(Exchange.HTTP_METHOD, constant("DELETE"))
.setHeader("Connection", constant("Close"))
.enrich()
.simple("http:127.0.0.1/restfuldb${header.pdu_id}?httpClient.authenticationPreemptive=true")
.log("${header.pdu_id} tarred and deleted.");

Currently the solution to the problem consisted of a few changes based on what #daBigBug answered.
pollEnrich's simple expression uses antInclude rather than fileName;
aggregate is put after pollenrich; as each batch is the set of files, rather than the input from the queue. The input from the queue only provides meta information based on which actions are to be taken.
aggregationStrategy() is not possible in a RouteBuilder; I used aggregate() instead.
I removed the unmarshal(); I don't see why this would be needed; the files can contain binary content.
from("broker1:todelete.message_ids.queue")
.pollEnrich()
.simple("file:${in.header.pad}?antInclude=${in.header.pdu_id}.*")
.aggregate(new TarAggregationStrategy())
.constant(true)
.completionFromBatchConsumer()
.eagerCheckCompletion()
.log("tarring to: ${header.pad}${header.pdu_id}.tar")
.setHeader(Exchange.FILE_NAME, simple("${header.pdu_id}.tar"))
.setHeader(Exchange.FILE_PATH, simple("${header.pad}"))
.to("file://ignored")
...(and the rest of the operations);
I now see the files are getting picked up and even placed in a tar; however, the filename of the tar is unexpected as is the location (it's placed in ./ignored); Also in the rest of the operation, it appears the exchange headers are lost.
If anyone can help figure out how to preserve the headers in a safe way... I'm much obliged. Should I use a new question for that, or should I rephrase the question.

Dynamic Apache Camel Output Route

Hi i want to compute a dynamic output route using apache Camel. I receive a bunch of files in a folder location, based on its contents i want to move the file to dynamic output folder. The name of the ouput folder will be constructed based on the input content of the file. How do i acheive it.
The Following piece of code read the files, processes them, but i am not sure how to set the value of ${foldername} based on the contents of the file
from("file:D:\\camel\\input\\one?recursive=true&delete=true")
.process(new LogProcessor())
.to("file:D:\\camel\\output\\${foldername}")
Please assist

You could create a custom processor to construct the foldername and insert into a header.
public class DirectoryNameProcessor implements Processor {
#Override
public void process(Exchange exchange) {
Message in = exchange.getIn();
// Get the contents of the processed file
String body = in.getBody(String.class);
//Get the original file name
String fileName = in.getHeader("CamelFileName", String.class);
// Perform your logic
in.setHeader("foldername");
}
}
Then in your route you could access the newly created foldername-header:
.to("file:D:\\camel\\output\\${header.foldername}");

The short answer is, you can use the dynamic to endpoint toD.
http://camel.apache.org/message-endpoint.html#MessageEndpoint-DynamicTo
It would look like:
from("file:D:\\camel\\input\\one?recursive=true&delete=true")
.process(new LogProcessor())
.toD("file:D:\\camel\\output\\${foldername}")

Using apache camel csv processor with pollEnrich pattern?

Apache Camel 2.12.1
Is it possible to use the Camel CSV component with a pollEnrich? Every example I see is like:
from("file:somefile.csv").marshal...
Whereas I'm using the pollEnrich, like:
pollEnrich("file:somefile.csv", new CSVAggregator())
So within CSVAggregator I have no csv...I just have a file, which I have to do csv processing myself. So is there a way of hooking up the marshalling to the enrich bit somehow...?
EDIT
To make this more general... eg:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy) <--how can I call marshal() on this?
...
public class CSVAggregator implements AggregationStrategy {
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
/* Here I have:
oldExchange = results of http blah endpoint
newExchange = the someFile.csv GenericFile object */
}
Is there any way I can avoid this and use marshal().csv sort of call on the route itself?
Thanks,
Mr Tea

You can use any endpoint in enrich. That includes direct endpoints pointing to other routes. Your example...
Replace this:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy)
With this:
from("direct:start")
.to("http:www.blah")
.enrich("direct:readSomeFile", new CSVAggregationStrategy);
from("direct:readSomeFile")
.to("file:someFile.csv")
.unmarshal(myDataFormat);

I ran into the same issue and managed to solve it with the following code (note, I'm using the scala dsl). My use case was slightly different, I wanted to load a CSV file and enrich it with data from an additional static CSV file.
from("direct:start") pollEnrich("file:c:/data/inbox?fileName=vipleaderboard.inclusions.csv&noop=true") unmarshal(csv)
from("file:c:/data/inbox?fileName=vipleaderboard.${date:now:yyyyMMdd}.csv") unmarshal(csv) enrich("direct:start", (current:Exchange, myStatic:Exchange) => {
// both exchange in bodies will contain lists instead of the file handles
})
Here the second route is the one which looks for a file in a specific directory. It unmarshals the CSV data from any matching file it finds and enriches it with the direct route defined in the preceding line. That route is pollEnriching with my static file and as I don't define an aggregation strategy it just replaces the contents of the body with the static file data. I can then unmarshal that from CSV and return the data.
The aggregation function in the second route then has access to both files' CSV data as List<List<String>> instead of just a file.

How can Apache Camel be used to monitor file changes?

I would like to monitor all of the files in a given directory for changes, ie an updated timestamp. This use case seems natural for Camel using the file component, but I can't seem to find a way to configure this behavior.
A uri like:
file:/some/directory
will consume the files in the provided directory but will delete them.
A uri like:
file:/some/directory?noop=true
consumes each file once when it is added or when the route is started.
It's surprising that there isn't an option along the lines of
consumeOnChange=true
Is there a straightforward way to monitor file changes and not delete the file after consuming?

You can do this by setting up the idempotentKey to tell Camel how a file is considered changed. For example if the file size changes, or its timestamp changes etc.
See more details at the Camel file documentation at: https://camel.apache.org/components/latest/file-component.html
See the section Avoiding reading the same file more than once (idempotent consumer). And read about idempotent and idempotentKey.
So something alike
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:size}")
Or
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:modified}")
You can read here about the various ${file:xxx} tokens you can use: http://camel.apache.org/file-language.html

Setting noop to true will result in Camel setting idempotent=true as well, despite the fact that idempotent is false by default.
Simplest solution to monitor files would be:
.from("file:path?noop=true&idempotent=false&delay=60s")
This will monitor changes to all files in the given directory every one minute.
This can be found in the Camel documentation at: http://camel.apache.org/file2.html.

I don't think Camel supports that specific feature but with the existent options you can come up with a similar solution of monitoring a directory.
What you need to do is set a small delay value to check the directory and maintain a repository of the already read files. Depending on how you configure the repository (by size, by filename, by a mix of them...) this solution would be able to provide you information about news files and modified files. As a caveat it would be consuming the files in the directory very often.
Maybe you could use other solutions different from Camel like Apache Commons VFS2 (I wrote a explanation about how to use it for this scenario: WatchService locks some files?

I faced the same problem i.e. wanted to copy updated files also (along with new files). Below is my configuration,
public static void main(String[] a) throws Exception {
CamelContext cc = new DefaultCamelContext();
cc.addRoutes(createRouteBuilder());
cc.start();
Thread.sleep(10 * 60 * 1000);
cc.stop();
}
protected static RouteBuilder createRouteBuilder() {
return new RouteBuilder() {
public void configure() {
from("file://D:/Production"
+ "?idempotent=true"
+ "&idempotentKey=${file:name}-${file:size}"
+ "&include=.*.log"
+ "&noop=true"
+ "&readLock=changed")
.to("file://D:/LogRepository");
}
};
}
My testing steps:
Run the program and it copies few .log files from D:/Production to D:/LogRepository and then continues to poll D:/Production directory
I opened a already copied log say A.log from D:/Production (since noop=true nothing is moved) and edited it with some editor tool. This doubled the file size and save it.
At this point I think Camel is supposed to copy that particular file again since its size is modified and in my route definition I used "idempotent=true&idempotentKey=${file:name}-${file:size}&readLock=changed". But camel ignores the file.
When I use TRACE for logging it says "Skipping as file is already in progress...", but I did not find any lock file in D:/Production directory when I editted and saved the file.
I also checked that camel still ignores the file if I replace A.log (with same name but bigger size) in D:/Production directory from outside.
But I found, everything is working as expected if I remove noop=true option.
Am I missing something?

If you want monitor file changes in camel, use file-watch component.
Example -> RECURSIVE WATCH ALL EVENTS (FILE CREATION, FILE DELETION, FILE MODIFICATION):
from("file-watch://some-directory")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}");
You can see the complete documentation here:
Camel file-watch component

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Camel for multiple files processing - apache-camel

Related

Emitting the "Side Outputs" and "process output" in single sink with different path

Camel: how to aggregate files based on exchange in pattern

Dynamic Apache Camel Output Route

Using apache camel csv processor with pollEnrich pattern?

How can Apache Camel be used to monitor file changes?

Categories

Resources