sFTP with multiple consumers and idempotent routes: file exceptions - apache-camel

I have a couple of Camel routes that are processing files that are located an sFTP server. There are multiple nodes running the application where the routes are located, so I have added file locking to ensure only a single node process each file.
The route URI looks like this (split over multiple lines for readability):
sftp://<user>#<server>:<port>/<folder>
?password=<password>
&readLock=changed
&readLockMinAge=10s
&delete=true
&delay=60s
&include=<file mask>
The route looks like this:
from(inUri) //
.id("myId") //
.idempotentConsumer(header(Exchange.FILE_NAME), messageIdRepository) //
.eager(true) //
.skipDuplicate(true) //
.log(LoggingLevel.INFO, "Processing file: ${file:name}") //
// Save original file in archive directory.
.to(archiveUri) //
... do other stuff...
Every now and then, I'm getting what looks like file contention warning messages:
Error processing file RemoteFile[<file>] due to Cannot retrieve file: <file>. Caused by:
[org.apache.camel.component.file.GenericFileOperationFailedException - Cannot retrieve file: <file>]
org.apache.camel.component.file.GenericFileOperationFailedException: Cannot retrieve file: <file>
Caused by: com.jcraft.jsch.SftpException: No such file
... and also these:
Error during commit. Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot delete file: <file>]
org.apache.camel.component.file.GenericFileOperationFailedException: Cannot delete file: <file>
Caused by: com.jcraft.jsch.SftpException: No such file
Have I missed anything in my setup?
I've tried adding the idempotent repository to the input URI, like this:
sftp://<user>#<server>:<port>/<folder>
?password=<password>
&readLock=changed
&readLockMinAge=<minAge>
&delete=true&delay=<delay>
&include=<file mask>
&idempotent=true
&idempotentKey=$simple{file:name}
&idempotentRepository=#messageIdRepository
... but am getting the same kind of errors.
I'm using Apache Camel version 2.24.2.

File locking and multiple nodes is (in my experience) not very reliable. When the clients are distributed, a distributed file lock must be supported and I don't know what the FTP layer adds to this mix, but I don't think it makes it easier.
Normally, these errors you get are "just" from the one node that tries to process an already processed (and therefore deleted) file. So they are not a real problem and if you can live with them, your setup works probably correct (files are not really processed multiple times).
What I usually do, is
Use initialDelay and delay options to avoid that all nodes are polling at the same time
Use the option shuffle=true so that every client shuffles the file list randomly. That lowers the chance that all clients try to process the same file
These are as well not reliable, they can at most be an improvement.
The only way to reliably avoid processing the same file with multiple nodes is perhaps by using just one consumer. We use this setup in combination with health checks and automatic restarts to make sure the single consumer is available.
Notice that adding an idempotent repository is worth nothing until you use a distributed repository like hazelcast. If you only use a local repository (or just in-memory repository), it is individual per node and therefore just avoids that the same client processes the same file multiple times.
Update due to comment
When you got a distributed idempotent repository and files are processed multiple times, the repository does not work. I can only assume some things that could be the source
The repository does nothing (check for entries)
You have to set readLock to idempotent to "enable" the repository for the file component
For whatever reasons the same file from different nodes has different idempotency IDs (check in your repo) and therefore they are not recognized as duplicates
Unfortunately the idempotent setting of readLock is only supported for the file component, not for the FTP component.
There is also an inProgressRepository. You could also try to attach your repository with this option.

Related

Using camel-smb SMB picks up (large) files while still be written to

When trying to create cyclic moving of files encountered strange behavior with readLock. Create a large file (some 100Mb's) and transfer it using SMB from out to in folder.
FROM:
smb2://smbuser:****#localhost:4455/user/out?antInclude=FILENAME*&consumer.bridgeErrorHandler=true&delay=10000&inProgressRepository=%23inProgressRepository&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=true&username=smbuser&delete=true
TO:
smb2://smbuser:****#localhost:4455/user/in?username=smbuser
Create another flow to move the file back from IN to OUT folder. After some transfers the file will be picked up while still being written to by another route and a transfer will be done with a much smaller file, resulting in a partial file at the destination.
FROM:
smb2://smbuser:****#localhost:4455/user/in?antInclude=FILENAME*&delete=true&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=false&delay=10000
TO:
smb2://smbuser:****#localhost:4455/user/out
Question is: why my readLock is not working properly (p.s. streamDownload is required)?
UPDATE: turns out this only happens on windows samba share, and with streamDownload=true. So, something with stream chunking. Any advice welcome.
The solution requires to prevent polling strategy from automatically picking up a file, and become aware of readLock (in progress) on the other side. So I lowered delay to 5 seconds and in FROM part, on both sides, I added readLockMinAge to 5s which will inspect file modification time.
Since streaming goes for every second this is enough time to prevent read lock.
An explanation of why the previously mentioned situation happens:
When a route prepares to pick-up from out folder, a large file (1GB) is in progress chunk by chunk to in folder. At the end of the streaming file is marked for
removal by camel-smbj and file receive status STATUS_DELETE_PENDING.
Now another part of this process starts to send a newly arrived file to the out folder and finds out that this file already exists. Because of the default fileExists=Override strategy
It tries to delete (afterward store) an existing file (which is still not deleted from the previous step) and receives an exception which causes some InputStream
chunks to be lost.

Camel File Consumer - leave file after processing but accept files with same name

So this is the situation:
I have a workflow, that waits for files in a folder, processes them and then sends them to another system.
For different reasons we use an ActiveMQ Broker between "sub-processes" in the workflow, where each route alters the message in some way before it is sent in the last step. Each "sub-processes" only reads and writes to/from the ActiveMQ, except the first and last route.
It is also part of the workflow, that there is a route after sending the message, that takes care of initial file, moving or deleting it. Only this route knows that to do with the file.
This means, that the file has to stay in the folder after the consumer-route has finished, because the meta-data is just written into the ActiveMQ, but the actual workflow is not done yet.
It got this to work using the noop=true parameter on the file consumer.
The problem with this is, that after the "After Sending Route" deletes (or moves) the file, the file consumer will not react to new files with the same name again until I restart the route.
It is clear, that this is the expected and correct behavior, because its the point of the noop parameter to ignore a file that was consumed before, but this doesn´t help me.
The question is now how I get the file consumer to only process a file once as long as it is present in the folder, but "forget" about it as soon as some other process (in this case a different route) removes the file.
As an alternative I could let the file component move the file into a temp folder, from where it gets processed later and leave the cosuming folder empty, but this introduces new problems, that I'd like to avoid (e.g. moving a file with the same name into the folder, as long as the first one is not yet processed completely)
I'd love to hear some ideas on how to handle that case.
Greets Chris
You need to tell Camel to not only use the filename for idempotency checking.
In a similar situation, where I wanted to pick up changes to a file that was otherwise no-oped, I have the option
idempotentKey=${file:name}-${file:modified}
in my url, which ensures if you change the file, or a new file is created, it treats that as a different file and processes it.
Do be careful to check how many files you might be processing because the idempotent buffer is limited by default (to 1000 records I think), so if you were potentially processing more than 1000 files at a time, it might "forget" it's already processed file 1, when file 1001 arrives, and try and reprocess file 1 again.

How do I introduce a Delay in Camel to prevent file locking before the files are copied in?

I am using Camel, ActiveMq, and JMS to poll a directory and process any files it finds. The problem with larger files is they start processing before being fully copied into the directory. I has assumed (yes, I know what assume gets you) that the file system would prevent it -- but that doesn't seem to be true. The examples in the Camel docs do not seem to be working. Here is my code from within the configure method of the RouteBuilder:
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?consumer.initialDelay=100000")
.doTry()
.setProperty("servicePath").constant("/job")
.setProperty("serviceMethod").constant("POST")
.process("engImportJobsFromFileProcessor")
.doCatch(Exception.class)
.to("log:-- Add Job(s) Error -------------------------")
.choice()
.when(constant(env.getProperty("eng.mail.enabled.flag.add.jobslist.yn")).isEqualToIgnoreCase("Y"))
.setHeader("subject", constant(env.getProperty("integration.mq.topic.add.eng.jobslist.error.email.subject")))
.to("direct://email.eng")
.otherwise()
.to("log:-----------------------------------------")
.to("log:-- Email for JOBSLIST IS DISABLED")
.to("log:-----------------------------------------")
.end()
.end()
.log("Finished loading jobs from file ")
;
As you can see, I tried to set an 'initialDelay', I have also tried 'delay' and 'readLock=changed' and nothing made a difference. As soon as the file hits the directory, Camel starts processing. All I am after is a nice simple delay before the file is polled. Any ideas?
Use option readLockMinAge.
From File2 component documentation:
This option allows you to specify a minimum age a file must be before attempting to acquire the read lock. For example, use readLockMinAge=300s to require that the file is at least 5 minutes old.
For 100s delay could URI look like this:
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?readLock=changed&readLockMinAge=100s")
Use combination of the options "readLock=changed" , "readLockCheckInterval=1000" and readLockMinAge=20s
(1000 is in milliseconds and the default value, should be changed to higher value is writes are slower i.e the file size changes after a long time, this may happen on certain filesystems, that the file size changes not very frequently while transfer is in process)
The file component documentation # http://camel.apache.org/file2.html says
for readlock=changed
changed is using file length/modification timestamp to detect whether the file is currently being copied or not. Will at least use 1 sec. to determine this, so this option cannot consume files as fast as the others, but can be more reliable as the JDK IO API cannot always determine whether a file is currently being used by another process. The option readLockCheckInterval can be used to set the check frequency.
for readLockCheckInterval=1000
Camel 2.6: Interval in milliseconds for the read-lock, if supported by the read lock. This interval is used for sleeping between attempts to acquire the read lock. For example when using the changed read lock, you can set a higher interval period to cater for slow writes. The default of 1 sec. may be too fast if the producer is very slow writing the file.
for readLockMinAge=20s
Camel 2.15: This option applies only to readLock=change. This option allows you to specify a minimum age a file must be before attempting to acquire the read lock. For example, use readLockMinAge=300s to require that the file is at least 5 minutes old. This can speedup the poll when the file is old enough as it will acquire the read lock immediately.
So in the end your endpoint should look something like
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?consumer.initialDelay=100000&readLock=changed&readLockCheckInterval=1000&readLockMinAge=20s")
OK, turned out to be a combination of things. First off I test inside of IntelliJ and also outside for several reasons -- one is a security issue with using email within IDEA. Tomcat, outside of IntelliJ was picking up a classes folder in the webapps/ROOT directory, which was overwriting my changes to the uri options. That's what was driving me nuts. That ROOT folder had been there from a deployment error from several months ago. But it wasn't being picked up by IntelliJ even though I was using the same Tomcat instance. That's why it appear that my changes were being ignored.

Apache Camel - Copying a large file into a consumer folder

I have a route that expects that various files will be copied into an incoming folder. The route will proceed to move these files into a temp folder where it will do other stuff. The route is as follows:
<route id="incoming" >
<from uri="file://my/path/incoming"/>
<to uri="file://my/path/incoming/temp"/>
</route>
The issue is that these files may be quite large. Lets say 1Gb. In order to copy this file in to the incoming folder it may take lets say 10 seconds. During these 10 seconds the Consumer polls the directory and an exception is thrown since the partial file is still being copied. What workaround could I use?
I have used readLock all strategies (primarily changed) but I get an exception:
(The process cannot access the file because it is being used by another process)
The modified uri is as follows:
<from uri="file://my/file/path?readLockCheckInterval=3000&readLock=changed"/>
Still no luck though
Check the readLock options in the File component
Used by consumer, to only poll the files if it has exclusive read-lock on the file (i.e. the file is not in-progress or being written). Camel will wait until the file lock is granted.
This option provides the build in strategies:
markerFile Camel creates a marker file (fileName.camelLock) and then holds a lock on it.
changed is using file length/modification timestamp to detect whether the file is currently being copied or not. Will at least use 1 sec. to determine this, so this option cannot consume files as fast as the others, but can be more reliable as the JDK IO API cannot always determine whether a file is currently being used by another process. The option readLockCheckInterval can be used to set the check frequency.
fileLock is for using java.nio.channels.FileLock. This approach should be avoided when accessing a remote file system via a mount/share unless that file system supports distributed file locks.
rename is for using a try to rename the file as a test if we can get exclusive read-lock.
readLock=changed option seems appropriate in this case. There can be issues if you have a very slow producer writing files to the incoming folder.
Other option is to use the done file name. You can make the original producer create a done file after file write is completed.
its more common to have one done file per target file. This means there is
a 1:1 correlation. To do this you must use dynamic placeholders in the
doneFileName option. Currently Camel supports the following two
dynamic tokens: file:name and file:name.noext which must be enclosed
in ${ }. The consumer only supports the static part of the done file
name as either prefix or suffix (not both).
from("file:bar?doneFileName=${file:name}.done");
In this example onlyfiles will be polled if there exists a done file with the name file
name.done.
Something like this will work out. Just in case if its NON-Camel system is copying your large file into the InputDir, then you have to take care to create the .DONE file after the file is copied. Once the .DONE file is available the route will start processing.
from("file://" + InputDir + "?delay=500&doneFileName=${file:name}.DONE")
.to("file://" + OutputDir + "?fileName=${date:now:yyyyMMdd}/${file:name}&doneFileName=${file:name}.DATA.READY.DONE");
Probably this is late in the game but use fileExist=Append in the route URI. Example:
<route id="incoming" >
<from uri="file://my/path/incoming"/>
<to uri="file://my/path/incoming/temp?fileExist=Append"/>
</route>

Losing files in camel when consuming with multiple threads

I'm using Apache Camel 2.11.1
Have such route:
from("file:///somewhere/").
threads(20).
to("direct:process")
Some time I'm getting this exception: org.apache.camel.InvalidPayloadException with message
No body available of type: java.io.InputStream but has value: GenericFile[/somewhere/file.txt] of type:
org.apache.camel.component.file.GenericFile on: file.txt. Caused by: Error during type conversion from type:
org.apache.camel.component.file.GenericFile to the required type: byte[] with value GenericFile[/somewhere/file.txt]
due java.io.FileNotFoundException: /somewhere/file.txt (No such file or directory).
Since I'm seeing lot of .camelLock files in the directory, I assume this happens due to attempt of few threads to process same file. How to avoid that?
UPDATE 1
Tried to use scheduledExecutorService and remove threads(20). Seems I'm losing fewer files, but still losing them. How to avoid? Any help will be greatly appreciated.
I got the similar issue, mine was 2 file processors retrieving from same directory. Result: Losing all my files.
Here is the scenario:
Thread#1 retrieves file1, moves to process folder
Thread#2 retrieves same file: file1 at the same time. file1 is deleted
Thread#2 cannot find file1 in source directory, rename fails.
Thread#1 fails due to deleted file by Thread#2
Here is the reason:
If you check GenericFileProcessStrategySupport.renameFile method, you'll see camel first deletes target file, then renames source file to target. That's why above condition occurs
I dont know about a generic solution, either separating source-consumer relation or implement a work distributor mechanism should be implemented.
Sınce your threads live in same JVM, I suggest you to implement a concurrent load distributor. That'd give requester 1 file name at a time in concurrent way

Resources