Sorting issue with Apache camel - file

We have an application were we use the file component of apache camel. We implemented our own comparator to which we refer using #sorter. The file component reads files from four different folders and sorts them.
We have maxmessagesPerPoll set to 0 and eagerMaxMessagesPerPoll set to false.
The following described issue happens when we have somewhere between 1k to 5k files in the four folders combined.
Camel has apparently two threads Thread #1 and Thread#2, usually Thread #1 runs the sorting code and Thread#2 processes the files. But when there are between 1k to 5k files or more even thread#1 starts processing which causes files to go out of order. See the logs in Listing 1 to see an example of how thread#1 and thread#2 are both processing the file.
FYI initial sorting for all 5000 files was done by thread #1, but during processing at times thread #1 contributes to processing the file too which results in files going out of order. This does not happen if the number of files are low like 200 eg. then only thread #2 processes the files.
How can I keep the processing confined to just thread#2, is there a property that can be set?
Listing 1
20200829 13:45:00.516 - [Camel (xyz) **thread #1** - file:///export/data/abc/xyz/zyz] INFO a.b.c.Transformer - Processing started for file /export/data/abc/xyz/zyz//f/g/h../run/file1.xml
20200829 13:45:00.576 - [Camel (xyz) **thread #1** - file:///export/data/abc/xyz/zyz] INFO a.b.c.Transformer - Processing completed for file /export/data/abc/xyz/zyz//f/g/h../run/file1.xml in 0 seconds
20200829 15:15:14.910 - [Camel (xyz) **thread #2** - Threads] INFO a.b.c.Transformer - Processing started for file /export/data/abc/xyz/zyz/g/f/h../run/file2_XML
20200829 15:15:15.007 - [Camel (xyz) **thread #2** - Threads] INFO a.b.c.Transformer - Processing completed for file /export/data/abc/xyz/zyz/g/f/h../run/file2_XML in 0 seconds
I tried the following suggestion -
Use maxMessagersPerPoll=1 and set eagerMaxMessagesPerPoll=false
as found here http://www.davsclaus.com/2008/12/camel-and-file-sorting.html
but that presents its own problems. Say there are 3000 files, it processes one file and then resorts the remaining files, which slows the whole process considerably since sorting takes like more then 45 minutes.

The secret to this was to use the synchronous query option as described in File2 component documentation of apache camel. There will always be two threads. Once you use synchronous only thread #2 processes the files and not thread #1.
Additionally leave maxMessagesPerPoll to 0 and eagerMaxMessagesPerPoll to true.
I would have to say that documentation of camel is poor and not without its grammatical mistakes.

Related

Spark alternatives for reading incomplete files

I am using Spark 2.3 to read Parquet files. The files average about 20mb each and currently have about 20,000 of them. The file directories are partitioned down to the day level across 24 months.
The issue I am facing is that on an occasional basis most of the files are rewritten by an Hbase application. This does not occur every day, but when it does it takes several days for that process to complete for all files. My process needs to look at the data in ALL files each day to scan for changes based on an update date stored in each record. But I am finding if I read the HDFS dir during the period of the mass file rewrite, my Spark program fails and I get errors like these:
java.io.EOFException: Cannot seek after EOF
OR
java.io.IOException: Could not read footer for file: FileStatus{path=hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet
OR
Caused by: java.lang.RuntimeException: hdfs://project/data/files/year=2020/month=02/day=10/file_20200210_11.parquet is not a Parquet file. expected magic number at tail
I assume the errors are because it is trying to read a file that is not finalized (still being written to). But from Spark's perspective it looks like a corrupt file. So my process would fail and not be able to read the files for a couple days until all files are stable.
Code is pretty basic:
val readFilePath="/project/data/files/"
val df = spark.read
.option("spark.sql.parquet.mergeSchema", "true")
.parquet(readFilePath)
.select(
$"sourcetimestamp",
$"url",
$"recordtype",
$"updatetimestamp"
)
.withColumn("end_date", to_date(date_format(lit("2021-08-23"), "yyyy-MM-dd")))
.withColumn("start_date", date_sub($"end_date", 2))
.withColumn("update_date", to_date($"updatetimestamp","yyyy-MM-dd"))
.filter( $"update_date" >= $"start_date" and $"update_date" <= $"end_date" )
Is there anything I can do programmatically to work around a problem like this? It doesn't seem like I can trap an error like that and make it continue. In Spark version 3 there is an option "spark.sql.files.ignoreCorruptFiles", that I think would help, but not in version 2.3 so that doesn't help me.
I considered reading a single file at a time and loop through all files, but that would take forever (about 12 hours based on a test of just a single month). Outside of asking the application owner to write changed files to a temp dir then move and replace them as each is completed into the main dir, I don't see any other alternatives so far. Not even sure if that would help or if I would run into collisions during the small period of time the file was being moved.

Using camel-smb SMB picks up (large) files while still be written to

When trying to create cyclic moving of files encountered strange behavior with readLock. Create a large file (some 100Mb's) and transfer it using SMB from out to in folder.
FROM:
smb2://smbuser:****#localhost:4455/user/out?antInclude=FILENAME*&consumer.bridgeErrorHandler=true&delay=10000&inProgressRepository=%23inProgressRepository&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=true&username=smbuser&delete=true
TO:
smb2://smbuser:****#localhost:4455/user/in?username=smbuser
Create another flow to move the file back from IN to OUT folder. After some transfers the file will be picked up while still being written to by another route and a transfer will be done with a much smaller file, resulting in a partial file at the destination.
FROM:
smb2://smbuser:****#localhost:4455/user/in?antInclude=FILENAME*&delete=true&readLock=changed&readLockMinLength=1&readLockCheckInterval=1000&readLockTimeout=5000&streamDownload=false&delay=10000
TO:
smb2://smbuser:****#localhost:4455/user/out
Question is: why my readLock is not working properly (p.s. streamDownload is required)?
UPDATE: turns out this only happens on windows samba share, and with streamDownload=true. So, something with stream chunking. Any advice welcome.
The solution requires to prevent polling strategy from automatically picking up a file, and become aware of readLock (in progress) on the other side. So I lowered delay to 5 seconds and in FROM part, on both sides, I added readLockMinAge to 5s which will inspect file modification time.
Since streaming goes for every second this is enough time to prevent read lock.
An explanation of why the previously mentioned situation happens:
When a route prepares to pick-up from out folder, a large file (1GB) is in progress chunk by chunk to in folder. At the end of the streaming file is marked for
removal by camel-smbj and file receive status STATUS_DELETE_PENDING.
Now another part of this process starts to send a newly arrived file to the out folder and finds out that this file already exists. Because of the default fileExists=Override strategy
It tries to delete (afterward store) an existing file (which is still not deleted from the previous step) and receives an exception which causes some InputStream
chunks to be lost.

sFTP with multiple consumers and idempotent routes: file exceptions

I have a couple of Camel routes that are processing files that are located an sFTP server. There are multiple nodes running the application where the routes are located, so I have added file locking to ensure only a single node process each file.
The route URI looks like this (split over multiple lines for readability):
sftp://<user>#<server>:<port>/<folder>
?password=<password>
&readLock=changed
&readLockMinAge=10s
&delete=true
&delay=60s
&include=<file mask>
The route looks like this:
from(inUri) //
.id("myId") //
.idempotentConsumer(header(Exchange.FILE_NAME), messageIdRepository) //
.eager(true) //
.skipDuplicate(true) //
.log(LoggingLevel.INFO, "Processing file: ${file:name}") //
// Save original file in archive directory.
.to(archiveUri) //
... do other stuff...
Every now and then, I'm getting what looks like file contention warning messages:
Error processing file RemoteFile[<file>] due to Cannot retrieve file: <file>. Caused by:
[org.apache.camel.component.file.GenericFileOperationFailedException - Cannot retrieve file: <file>]
org.apache.camel.component.file.GenericFileOperationFailedException: Cannot retrieve file: <file>
Caused by: com.jcraft.jsch.SftpException: No such file
... and also these:
Error during commit. Caused by: [org.apache.camel.component.file.GenericFileOperationFailedException - Cannot delete file: <file>]
org.apache.camel.component.file.GenericFileOperationFailedException: Cannot delete file: <file>
Caused by: com.jcraft.jsch.SftpException: No such file
Have I missed anything in my setup?
I've tried adding the idempotent repository to the input URI, like this:
sftp://<user>#<server>:<port>/<folder>
?password=<password>
&readLock=changed
&readLockMinAge=<minAge>
&delete=true&delay=<delay>
&include=<file mask>
&idempotent=true
&idempotentKey=$simple{file:name}
&idempotentRepository=#messageIdRepository
... but am getting the same kind of errors.
I'm using Apache Camel version 2.24.2.
File locking and multiple nodes is (in my experience) not very reliable. When the clients are distributed, a distributed file lock must be supported and I don't know what the FTP layer adds to this mix, but I don't think it makes it easier.
Normally, these errors you get are "just" from the one node that tries to process an already processed (and therefore deleted) file. So they are not a real problem and if you can live with them, your setup works probably correct (files are not really processed multiple times).
What I usually do, is
Use initialDelay and delay options to avoid that all nodes are polling at the same time
Use the option shuffle=true so that every client shuffles the file list randomly. That lowers the chance that all clients try to process the same file
These are as well not reliable, they can at most be an improvement.
The only way to reliably avoid processing the same file with multiple nodes is perhaps by using just one consumer. We use this setup in combination with health checks and automatic restarts to make sure the single consumer is available.
Notice that adding an idempotent repository is worth nothing until you use a distributed repository like hazelcast. If you only use a local repository (or just in-memory repository), it is individual per node and therefore just avoids that the same client processes the same file multiple times.
Update due to comment
When you got a distributed idempotent repository and files are processed multiple times, the repository does not work. I can only assume some things that could be the source
The repository does nothing (check for entries)
You have to set readLock to idempotent to "enable" the repository for the file component
For whatever reasons the same file from different nodes has different idempotency IDs (check in your repo) and therefore they are not recognized as duplicates
Unfortunately the idempotent setting of readLock is only supported for the file component, not for the FTP component.
There is also an inProgressRepository. You could also try to attach your repository with this option.

How do I introduce a Delay in Camel to prevent file locking before the files are copied in?

I am using Camel, ActiveMq, and JMS to poll a directory and process any files it finds. The problem with larger files is they start processing before being fully copied into the directory. I has assumed (yes, I know what assume gets you) that the file system would prevent it -- but that doesn't seem to be true. The examples in the Camel docs do not seem to be working. Here is my code from within the configure method of the RouteBuilder:
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?consumer.initialDelay=100000")
.doTry()
.setProperty("servicePath").constant("/job")
.setProperty("serviceMethod").constant("POST")
.process("engImportJobsFromFileProcessor")
.doCatch(Exception.class)
.to("log:-- Add Job(s) Error -------------------------")
.choice()
.when(constant(env.getProperty("eng.mail.enabled.flag.add.jobslist.yn")).isEqualToIgnoreCase("Y"))
.setHeader("subject", constant(env.getProperty("integration.mq.topic.add.eng.jobslist.error.email.subject")))
.to("direct://email.eng")
.otherwise()
.to("log:-----------------------------------------")
.to("log:-- Email for JOBSLIST IS DISABLED")
.to("log:-----------------------------------------")
.end()
.end()
.log("Finished loading jobs from file ")
;
As you can see, I tried to set an 'initialDelay', I have also tried 'delay' and 'readLock=changed' and nothing made a difference. As soon as the file hits the directory, Camel starts processing. All I am after is a nice simple delay before the file is polled. Any ideas?
Use option readLockMinAge.
From File2 component documentation:
This option allows you to specify a minimum age a file must be before attempting to acquire the read lock. For example, use readLockMinAge=300s to require that the file is at least 5 minutes old.
For 100s delay could URI look like this:
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?readLock=changed&readLockMinAge=100s")
Use combination of the options "readLock=changed" , "readLockCheckInterval=1000" and readLockMinAge=20s
(1000 is in milliseconds and the default value, should be changed to higher value is writes are slower i.e the file size changes after a long time, this may happen on certain filesystems, that the file size changes not very frequently while transfer is in process)
The file component documentation # http://camel.apache.org/file2.html says
for readlock=changed
changed is using file length/modification timestamp to detect whether the file is currently being copied or not. Will at least use 1 sec. to determine this, so this option cannot consume files as fast as the others, but can be more reliable as the JDK IO API cannot always determine whether a file is currently being used by another process. The option readLockCheckInterval can be used to set the check frequency.
for readLockCheckInterval=1000
Camel 2.6: Interval in milliseconds for the read-lock, if supported by the read lock. This interval is used for sleeping between attempts to acquire the read lock. For example when using the changed read lock, you can set a higher interval period to cater for slow writes. The default of 1 sec. may be too fast if the producer is very slow writing the file.
for readLockMinAge=20s
Camel 2.15: This option applies only to readLock=change. This option allows you to specify a minimum age a file must be before attempting to acquire the read lock. For example, use readLockMinAge=300s to require that the file is at least 5 minutes old. This can speedup the poll when the file is old enough as it will acquire the read lock immediately.
So in the end your endpoint should look something like
from("file://" + env.getProperty("integration.directory.scan.add.eng.jobslist")+"?consumer.initialDelay=100000&readLock=changed&readLockCheckInterval=1000&readLockMinAge=20s")
OK, turned out to be a combination of things. First off I test inside of IntelliJ and also outside for several reasons -- one is a security issue with using email within IDEA. Tomcat, outside of IntelliJ was picking up a classes folder in the webapps/ROOT directory, which was overwriting my changes to the uri options. That's what was driving me nuts. That ROOT folder had been there from a deployment error from several months ago. But it wasn't being picked up by IntelliJ even though I was using the same Tomcat instance. That's why it appear that my changes were being ignored.

Losing files in camel when consuming with multiple threads

I'm using Apache Camel 2.11.1
Have such route:
from("file:///somewhere/").
threads(20).
to("direct:process")
Some time I'm getting this exception: org.apache.camel.InvalidPayloadException with message
No body available of type: java.io.InputStream but has value: GenericFile[/somewhere/file.txt] of type:
org.apache.camel.component.file.GenericFile on: file.txt. Caused by: Error during type conversion from type:
org.apache.camel.component.file.GenericFile to the required type: byte[] with value GenericFile[/somewhere/file.txt]
due java.io.FileNotFoundException: /somewhere/file.txt (No such file or directory).
Since I'm seeing lot of .camelLock files in the directory, I assume this happens due to attempt of few threads to process same file. How to avoid that?
UPDATE 1
Tried to use scheduledExecutorService and remove threads(20). Seems I'm losing fewer files, but still losing them. How to avoid? Any help will be greatly appreciated.
I got the similar issue, mine was 2 file processors retrieving from same directory. Result: Losing all my files.
Here is the scenario:
Thread#1 retrieves file1, moves to process folder
Thread#2 retrieves same file: file1 at the same time. file1 is deleted
Thread#2 cannot find file1 in source directory, rename fails.
Thread#1 fails due to deleted file by Thread#2
Here is the reason:
If you check GenericFileProcessStrategySupport.renameFile method, you'll see camel first deletes target file, then renames source file to target. That's why above condition occurs
I dont know about a generic solution, either separating source-consumer relation or implement a work distributor mechanism should be implemented.
Sınce your threads live in same JVM, I suggest you to implement a concurrent load distributor. That'd give requester 1 file name at a time in concurrent way

Resources