i want to read FTP file using apache camel - apache-camel

i want to read FTP file using Apache camel but my requirement is like that to pick all files around 4-5 files and process them but my question is that how can i pick files only for specific date example i want to pick all file which is created today leave yesterday file.
how can i write code to pick files from FTP using apache camel with filtration on dates

you can implement your custom filter and ask camel to process only the files that satisfy the filter
Eg:
public class DateFilter<T> implements GenericFileFilter<T> {
public boolean accept(GenericFile<T> file) {
Calendar c = Calendar.getInstance();
c.set(Calendar.HOUR_OF_DAY, 0);
c.set(Calendar.MINUTE, 0);
c.set(Calendar.SECOND, 0);
c.set(Calendar.MILLISECOND, 0);
long todayInMillis = c.getTimeInMillis();
return file.getLastModified() >= todayInMillis;
}
}
Define the FileFilter as a bean
<bean id="dateFilter" class="com.something.DateFilter"/>
Use the above filter in your route
from("ftp://someuser#someftpserver.com?password=secret&filter=#dateFilter")
.to("somedestination")
Documentation here

Related

Use of compaction for Parquet bulk format

Since version 1.15 of Apache Flink you can use the compaction feature to merge several files into one.
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#compaction
How can we use compaction with bulk Parquet format?
The existing implementations for the RecordWiseFileCompactor.Reader (DecoderBasedReader and ImputFormatBasedReader) do not seem suitable for Parquet.
Furthermore we can not find any example for compacting Parquet or other bulk formats.
There are two types of file compactor mentioned in flink's document.
OutputStreamBasedFileCompactor : The users can write the compacted results into an output stream. This is useful when the users don’t want to or can’t read records from the input files.
RecordWiseFileCompactor : The compactor can read records one-by-one from the input files and write into the result file similar to the FileWriter.
If I remember correctly, Parquet saves meta information at end of files. So obviously we need to use RecordWiseFileCompactor. Because we need to read the whole Parquet file so we can get the meta information at the end of the file. Then we can use the meta information (number of row groups, schema) to parse the file.
From the java api, to construct a RecordWiseFileCompactor, we need a instance of RecordWiseFileCompactor.Reader.Factory.
There are two implementations of interface RecordWiseFileCompactor.Reader.Factory, DecoderBasedReader.Factory and InputFormatBasedReader.Factory respectively.
DecoderBasedReader.Factory creates a DecoderBasedReader instance, which reads whole file content from InputStream. We can load the bytes into a buffer and parse the file from the byte buffer, which is obviously painful. So we don't use this implementation.
InputFormatBasedReader.Factory creates a InputFormatBasedReader, which reads whole file content using the FileInputFormat supplier we passed to InputFormatBasedReader.Factory constructor.
The InputFormatBasedReader instance uses the FileInputFormat to read record by record, and pass records to the writer which we passed to forBulkFormat call, till the end of the file.
The writer receives all the records and compact the records into one file.
So the question becomes what is FileInputFormat and how to implement it.
Though there are many methods and fields of class FileInputFormat, we know only four methods are called from InputFormatBasedReader from InputFormatBasedReader source code mentioned above.
open(FileInputSplit fileSplit), which opens the file
reachedEnd(), which checks if we hit end of file
nextRecord(), which reads next record from the opened file
close(), which cleans up the site
Luckily, there's a AvroParquetReader from package org.apache.parquet.avro we can utilize. It has already implemented open/read/close. So we can wrap the reader inside a FileInputFormat and use the AvroParquetReader to do all the dirty works.
Here's a example code snippet
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
public class ExampleFileInputFormat extends FileInputFormat<GenericRecord> {
private ParquetReader<GenericRecord> parquetReader;
private GenericRecord readRecord;
#Override
public void open(FileInputSplit split) throws IOException {
Configuration config = new Configuration();
// set hadoop config here
// for example, if you are using gcs, set fs.gs.impl here
// i haven't tried to use core-site.xml but i believe this is feasible
InputFile inputFile = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(split.getPath().toUri()), config);
parquetReader = AvroParquetReader.<GenericRecord>builder(inputFile).build();
readRecord = parquetReader.read();
}
#Override
public void close() throws IOException {
parquetReader.close();
}
#Override
public boolean reachedEnd() throws IOException {
return readRecord == null;
}
#Override
public GenericRecord nextRecord(GenericRecord genericRecord) throws IOException {
GenericRecord r = readRecord;
readRecord = parquetReader.read();
return r;
}
}
Then you can use the ExampleFileInputFormat like below
FileSink<GenericRecord> sink = FileSink.forBulkFormat(
new Path(path),
AvroParquetWriters.forGenericRecord(schema))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.enableCompactionOnCheckpoint(10)
.build(),
new RecordWiseFileCompactor<>(
new InputFormatBasedReader.Factory<>(new SerializableSupplierWithException<FileInputFormat<GenericRecord>, IOException>() {
#Override
public FileInputFormat<GenericRecord> get() throws IOException {
FileInputFormat<GenericRecord> format = new ExampleFileInputFormat();
return format;
}
})
))
.build();
I have successfully deployed this to a flink on k8s and compacted files on gcs. There're some notes for deploying.
You need to download flink shaded hadoop jar from https://flink.apache.org/downloads.html (search Pre-bundled Hadoop in webpage) and the jar into $FLINK_HOME/lib/
If you are writing files to some object storage, for example gcs, you need to follow the plugin instruction. Remember to put the plugin jar into the plugin folder but not the lib foler.
If you are writing files to some object storage, you need to download the connector jar from cloud service supplier. For example, I'm using gcs and download gcs-connector jar following GCP instruction. Put the jar into some foler other than $FLINK_HOME/lib or $FLINK_HOME/plugins. I put the connector jar into a newly made folder $FLINK_HOME/hadoop-lib
Set environment HADOOP_CLASSPATH=$FLINK_HOME/lib/YOUR_SHADED_HADOOP_JAR:$FLINK_HOME/hadoop-lib/YOUR_CONNECTOR_JAR
After all these steps, you can start your job and good to go.

How to write files dynamically into subdirectory by FileWritingMessageHandler

I have to write files into multiple subdirectories based on a header attribute.
Not getting a way to configure it in Spring Integration.
#Bean
#ServiceActivator(inputChannel = "processingChannel")
public MessageHandler processingDirectory() {
FileWritingMessageHandler handler = new FileWritingMessageHandler(new File("some-path"));
handler.setFileExistsMode(FileExistsMode.REPLACE);
handler.setExpectReply(false);
handler.setPreserveTimestamp(true);
handler.setTemporaryFileSuffix(".writing");
handler.setAutoCreateDirectory(true);
return handler;
}
This bean receives a file along with some headers attributes i.e. type="abc" from "processingChannel" . Files are written successfully into some-path. But my requirement is to write into somepath/abc or somepath/xyz location based on "type" value
Use a SpEL expression
new FileWritingMessageHandler(new SpelExpressionParser().parseExpression(
"headers['someHeaderWithTheDestinationPath']"));

Camel for multiple files processing

I am a new at Camel. I am going to have a file processing with camel but I haven't found a ready solution for my case. I have to process multiple files together in case they exist. These files are uploaded to specific folder with some delays(Example: we have two files A.csv and B.csv, and A.csv is uploaded 10 sec later than B.csv and vice versa). Also if one file is absent more than specific time I need to process only a one file. Could anybody help me with choice a pattern ? As I understand I can use the camel filter to be sure that we already have these two files A.csv and B.csv and only then start processing, but it doesn't resolve my problem.
This is Aggregator EIP.
from("file:inputFolder")
.aggregate(constant(true), AggregationStrategies.groupedExchange())
.completionSize(2) //Wait for two files
.completionTimeout(60000) //Or process single file, if completionSize was not fulfilled within one minute
.to("log:do_something") //Here you can access List<Exchange> from message body
To group messages you can use correlation Expression. For your example (group messages by filename prefix before _) it could be something like this:
private final Expression CORRELATION_EXPRESSION = new Expression() {
#Override
public <T> T evaluate(Exchange exchange, Class<T> type) {
final String fileName = exchange.getIn().getHeader(Exchange.FILE_NAME, String.class);
final String correlationExpression = fileName.substring(0, fileName.indexOf('_'));
return exchange.getContext().getTypeConverter().convertTo(
type,
correlationExpression
);
}
};
And pass it to Aggregator:
from("file:inputDirectory")
.aggregate(CORRELATION_EXPRESSION, AggregationStrategies.groupedExchange())
...
See this gist for full example https://gist.github.com/bedlaj/a2a56aa9291bced8c0a8edebacaf22b0

Using apache camel csv processor with pollEnrich pattern?

Apache Camel 2.12.1
Is it possible to use the Camel CSV component with a pollEnrich? Every example I see is like:
from("file:somefile.csv").marshal...
Whereas I'm using the pollEnrich, like:
pollEnrich("file:somefile.csv", new CSVAggregator())
So within CSVAggregator I have no csv...I just have a file, which I have to do csv processing myself. So is there a way of hooking up the marshalling to the enrich bit somehow...?
EDIT
To make this more general... eg:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy) <--how can I call marshal() on this?
...
public class CSVAggregator implements AggregationStrategy {
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
/* Here I have:
oldExchange = results of http blah endpoint
newExchange = the someFile.csv GenericFile object */
}
Is there any way I can avoid this and use marshal().csv sort of call on the route itself?
Thanks,
Mr Tea
You can use any endpoint in enrich. That includes direct endpoints pointing to other routes. Your example...
Replace this:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy)
With this:
from("direct:start")
.to("http:www.blah")
.enrich("direct:readSomeFile", new CSVAggregationStrategy);
from("direct:readSomeFile")
.to("file:someFile.csv")
.unmarshal(myDataFormat);
I ran into the same issue and managed to solve it with the following code (note, I'm using the scala dsl). My use case was slightly different, I wanted to load a CSV file and enrich it with data from an additional static CSV file.
from("direct:start") pollEnrich("file:c:/data/inbox?fileName=vipleaderboard.inclusions.csv&noop=true") unmarshal(csv)
from("file:c:/data/inbox?fileName=vipleaderboard.${date:now:yyyyMMdd}.csv") unmarshal(csv) enrich("direct:start", (current:Exchange, myStatic:Exchange) => {
// both exchange in bodies will contain lists instead of the file handles
})
Here the second route is the one which looks for a file in a specific directory. It unmarshals the CSV data from any matching file it finds and enriches it with the direct route defined in the preceding line. That route is pollEnriching with my static file and as I don't define an aggregation strategy it just replaces the contents of the body with the static file data. I can then unmarshal that from CSV and return the data.
The aggregation function in the second route then has access to both files' CSV data as List<List<String>> instead of just a file.

copy related file with apache camel

First of all: i'm a camel newbie :-)
I want to watch a directory for xml files, then i want to move that xml file to another directory and move a pdf file with the same filename (but other extention) to the same directory, and then do some java stuff.
What is the best way to move that pdf file?
This is the route that i currently have:
from("file://C:/temp/camel/in?delete=true").filter(new Predicate() {
#Override
public boolean matches(final Exchange exchange) {
String filename = (String) exchange.getIn().getHeader("CamelFileRelativePath");
return "xml".equals(FilenameUtils.getExtension(filename));
}
})
.to("file://C:/temp/camel/out").bean(ServiceBean.class, "callWebservice")
Thanks!
You can achieve that without using a filter but a regular expression to filter only the 2 extensions, .xml and .pdf
from("file://C:/temp/camel/in?delete=true&?include=.*.xml|.*.zip")
.to("file://C:/temp/camel/out").bean(ServiceBean.class, "callWebservice");
If you use filter it will delete the files that you are not interested in, which might not be what you want, this solution will just leave them in that directory

Resources