camel unpacking tar.gzip files - apache-camel

After downloading several files with camel over FTP I should process them but they are in tar.gzip formats. Camel supports gzip and as I can see also tar endpoint from 2.16.0 onwards (http://camel.apache.org/camel-2160-release.html).
The code I have for extracting the gzip:
from("file:modelFiles?readLock=changed&recursive=true&consumer.delay=1000")
.unmarshal(new ZipFileDataFormat())
.choice()
.when(body().isNotNull())
.log("Uziping file ${file:name}.")
.to("file:modelFiles_unzipped")
.endChoice()
.end();
All the files run through the rule but they are created as .tar.gz again but the worse is that the content also becomes corrupt, so they cannot even be opened afterwards as gzip archives.
Questions:
How should I unpack the gzip archives?
How could I do the same for
the tar files?
Update 1:
Thanks for the post Jeremie. I changed the code like this as proposed:
from("file:modelFilesSBML2?readLock=changed&recursive=true&consumer.delay=1000")
.unmarshal().gzip()
.split(new TarSplitter())
.to("file:modelFilesSBML_unzipped");
Then I receive the following exception (just for info the tar.gzip files are not of zero length): FailedException: Cannot write null body to file: modelFilesSBML_unzipped\2006-01-31\BioModels_Database-r4-sbml_files.tar.gz :
2016-03-22 14:11:47,950 [ERROR|org.apache.camel.processor.DefaultErrorHandler|MarkerIgnoringBase] Failed delivery for (MessageId: ID-JOY-49807-1458652278822-0-592 on ExchangeId: ID-JOY-49807-1458652278822-0-591). Exhausted after delivery attempt: 1 caught: org.apache.camel.component.file.GenericFileOperationFailedException: Cannot write null body to file: modelFilesSBML_unzipped\2006-01-31\BioModels_Database-r4-sbml_files.tar.gz
Solution:
After trying many ways, I am using it finally as follows (with Camel 2.17.0 it did not work with 2.16.0 or 2.16.1):
from("file:modelFilesSBML?noop=true&recursive=true&consumer.delay=1000" )
.unmarshal().gzip()
.split(new TarSplitter())
.to("log:tar.gzip?level=INFO&showHeaders=true")
.choice()
.when(body().isNotNull())
.log("### Extracting file: ${file:name}.")
.to("file:modelFilesSBML_unzipped?fileName=${in.header.CamelFileRelativePath}_${file:name}")
.endChoice()
.end();
With Camel 2.17.0 you can also skip the body().isNotNull() check.
Jeremie's proposal help much, so I will accept his answer as a solution. Nevertheless, the exception would still come, if I did not check the message body for null. The fileName=${in.header.CamelFileRelativePath}_${file:name} keeps also the original file structure where the file name is prefixed with the file.tar.gz but I have not found any other way to preserve the directory structure as the file endpoint does not accept expressions for the directory in ("file:directory?options...").

You can use the camel-tarfile component.
If your tar.gz contain multiple files, you should ungzip, then untar and split the exchange for each file. The TarSplitter is an expression which split a tar into an iterator for each file contained in the tar.
from("file:target/from")
.unmarshal().gzip()
.split(new TarSplitter())
.to("file:target/to");

Related

Apache Camel doneFileName with changing name

I'm currently creating some route and for one of them I have a problem.
Usually I have a data file and then a done file which have the same name prefixed by "ACK" and this works perfectly with camel and the doneFileName option.
But for one of my route I have to work with a different situation, I still receive two files but they have the same typology, it's like: MyFILE-{{timestamp}}. The data file contains the data, and the done file contains just "done".
So I need something to check the content of the file, and if it's juste "done" then process the other file.
Is there a way to handle this with camel?
The most pragmatic solution I see is to write an "adapter script" (bash or whatever you have at your disposal) that peeks into every file with a timestamp in its name.
If the file content is "done":
Lookup the other "MyFILE-{{timestamp}}" (the data file) and rename it to "MyFILE"
Rename the done file to "MyFILE.done"
Camel can then import the data file using the standard done-file-option. Because both files are renamed to something without a timestamp, the peek-script ignores them after renaming.

Camel - Why the file consumer move option working differently with pollEnrich?

When I use a file consumer
<from uri="file:in?move=$simple{file:name}-transfered&include=^demo_keys\.ks$&sortBy=file:name" />
the file(s) are renamed to xxx-transfered (as I expected and stated in the doc) after processing.
But when I use the same with pollEnrich (for just one file)
<pollEnrich>
<simple>file://in?fileName=demo_keys.ks2&move=${camelId}-uploaded&sendEmptyMessageWhenIdle=true&maxMessagesPerPoll=1&delay=8000</simple>
</pollEnrich>
the file is not renamed after processing, instead moved into a newly created sub-directory with the original name.
How can I rename the pollEnrich processed file, achieve the same behaviour as a normal file consume?
I've tested it with v2.17.2 and v2.18.0
Thanks!
I think this may be a bug in Camel; try using a file language expression (e.g. ${file:name}) instead of ${camelId} just to be sure but the official documentation is pretty much clear in this case - it should interpret the value as a file name instead of directory.
I guess you should report a bug in Camel's JIRA.

Purpose of fs.hdfs.hadoopconf in flink-conf.yaml

Newbie to Flink.
I am able to run the example wordcount.jar on a file present in remote hdfs cluster without declaring fs.hdfs.hadoopconf variable in flink conf.
So wondering what exactly is the purpose of above mentioned variable.
Does declaring it changes the way one runs the example jar ?
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs://hadoop-master:9000/tmp/test-events
Output:
.......
07/13/2016 00:50:13 Job execution switched to status FINISHED.
(foo,1)
.....
(bar,1)
(one,1)
Setup :
Remote HDFS cluster on hdfs://hadoop-master.vm:9000
Flink cluster on running on flink-cluster.vm
Thanks
Update :
As pointed out by Serhiy, declared fs.hdfs.hadoopconf in conf but on running the job with updated argument hdfs:///tmp/test-events.1468374669125 got the following error
flink-conf.yaml
# You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml
# via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'.
#
fs.hdfs.hadoopconf: hdfs://hadoop-master:9000/
fs.hdfs.hdfsdefault : hdfs://hadoop-master:9000/
Command :
flink-cluster.vm ~]$ /opt/flink/bin/flink run /opt/flink/examples/batch/WordCount.jar --input hdfs:///tmp/test-events
Output :
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: The given HDFS file URI (hdfs:///tmp/test-events.1468374669125) did not describe the HDFS NameNode. The attempt to use a default HDFS configuration, as specified in the 'fs.hdfs.hdfsdefault' or 'fs.hdfs.hdfssite' config parameter failed due to the following problem: Either no default file system was registered, or the provided configuration contains no valid authority component (fs.default.name or fs.defaultFS) describing the (hdfs namenode) host and port.
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:172)
at org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:679)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1026)
... 19 more
From documentation:
fs.hdfs.hadoopconf: The absolute path to the Hadoop File System’s
(HDFS) configuration directory (OPTIONAL VALUE). Specifying this value
allows programs to reference HDFS files using short URIs
(hdfs:///path/to/files, without including the address and port of the
NameNode in the file URI). Without this option, HDFS files can be
accessed, but require fully qualified URIs like
hdfs://address:port/path/to/files. This option also causes file
writers to pick up the HDFS’s default values for block sizes and
replication factors. Flink will look for the “core-site.xml” and
“hdfs-site.xml” files in the specified directory.

How to process only the last file in a directory using Apache Camel's file component

I have a directory with files likes this:
inbox/
data.20130813T1921.json
data.20130818T0123.json
data.20130901T1342.json
I'm using Apache Camel 2.11 and on process start, I only want to process one file: the latest. The other files can actually be ignored. Alternatively, the older files can be deleted once a new file has been processed.
I'm configuring my component using the following, but it obviously doesn't do what I need:
file:inbox/?noop=true
noop does keep the last file, but also all other files. On startup, Camel processes all existing files, which is more than I need.
What is the best way to only process the latest file?
You can use the sorting and then sort by name, and possible need to reverse it so the latest is first / last. You can try it out to see which one you need. And then set maxMessagesPerPoll=1 to only pickup one file. And you need to set eagerMaxMessagesPerPoll=false to allow to sort before limiting the number of files.
You can find details at: http://camel.apache.org/file2. See the section Sorting using sortBy for the sorting.
An alternative would be to still using the sorting to ensure the latest file is last. Then you can use the aggregator EIP to aggregate all the files, and use org.apache.camel.processor.aggregate.UseLatestAggregationStrategy as the aggregation strategy to only keep the last (which would be the latest file). Then you can instruct the file endpoint to delete=true to delete the files when done. You would then also need to configure the aggregator to completionFromBatchConsumer=true.
The aggregator eip is documented here: http://camel.apache.org/aggregator2

Apache camel multiple file processing with exec

I am having trouble fixing this simple route, getting exception right after execute. Seems like execute is acting as Producer and over writing file.
Exception:
org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: C:\camel_tests\stage\Downloads.rar
Route:
Home directory will have a rar file with images, that should be extracted with winrar.exe, each file in the rar is file processed, and eventually moved to arch directory once this route done. Last successful stage is extracting files in the stage directory.
Here CMD_EXPLODE = "\"C:/Program Files/WinRAR/WinRAR.exe\"";
from("file://C:/camel_tests/home?fileName=Downloads.rar&preMove=//C:/camel_tests/stage")
.to("exec:"+Consts.CMD_EXPLODE+"?args=e Downloads.rar&workingDir=C:/camel_tests/stage&outFile=decompress_output.txt")
.to("file://C:/camel_tests/stage?exclude=.*.rar")
.process(new PrintFiles())
.to("file://C:/camel_tests/stage?fileName=Downloads.rar&move=//C:/camel_tests/arch").end();
You should split that into 2 routes. The first that does the from -> exec
And a 2nd from -> process -> to
The 2nd will then process each of the extracted file from winrar.

Resources