I have two camel applications and their duty is to read files from the same directory, process them and send them to db consumer. To do this, my endpoint are like this:
file:/data/air?preMove=thread&readLock=fileLock &idempotent=true&idempotentRepository=#fileStore&include=AIROUTPUTCDR_(.*).AIR.gz&move=/data/air/success&moveFailed=error
As u can see, application polls file from polldir based on filters, move them under thread dir to read, read the file and move to success folder.
But with this flow, if I kill an application and start it again, the files ,which were being processed, will not be processed because they are under threads folder.
My question is, is there a way to resume reading the files which are just interrupted?
Thanks
No if you do a hard kill on the application while a file was pre moved, then you would neeed manually to move these files from pre move, back into the source folder, so they can be picked up again
Related
Here is my question :
Will filnk finish the sink process and rename the .inprogress files to part-x-x files when sending a stop command?
I find my flink tasks(using flink-1.9.1) will not rename the .inprogress files to part-x-x files. But I read the source code, it says
org.apache.flink.client.program.ClusterClient#stopWithSavepoint:
* Stops a program on Flink cluster whose job-manager is configured in this client's configuration.
* Stopping works only for streaming programs. Be aware, that the program might continue to run for
* a while after sending the stop command, because after sources stopped to emit data all operators
* need to finish processing.
The StreamingFileSink does have some limitations in this regard. See this thread from the user#flink.apache.org mailing list.
FLIP-46, which is being tracked as FLINK-13103, is needed in order to fix this. Until then, the StreamingFileSink will remain unable to transition unfinished files to the finished state when a job is stopped. This is described in the documentation as Important Note 2.
In our ESB project, we have a lot of routes reading files with file2 or ftp protocol for further processing. Important to notice, that the files we read locally (file2 protocol) are mounted network shares via different protocols (NFS, SMB).
Now, we are facing issues with race conditions. Both servers read the file and process it. We have reduced the possibility of that by using the preMove option, but from time to time the duplicate reading still occurs when both servers poll at the same millisecond. According to the documentation, an idempotentRepository together with readLock=idempotent could help, for example with HazelCast.
However, I'm wondering if this is a suitable solution for my issue as I don't really know if it will work in all cases. It is within milliseconds that both servers read the file, so the information that one server has already processed the file need to be available in the HazelCast grid at the point in time when the second server tries to read. Is that possible? What happens if there are minimal latencies (e.g. network related)?
In addition to that, the setting readLock=idempotent is only available for file2 but not for ftp. How to solve that issue there?
Again: The issue is not preventing dublicate files in general, it is solely about preventing the race condition.
AFAIK the idempotent repository should prevent in your case that both consumers read the same file.
The latency between detection of the file and the entry in hazelcast is not relevant because the file consumers do not enter what they read. Instead they both ask the repository for an exclusive read-lock. The first one wins, the second one is denied, so it continues to the next file.
If you want to minimize the potential of conflicts between the consumers you can turn on shuffle=true to randomize the ordering of files to consume.
For the problem with the missing readLock=idempotent on the ftp consumer: you could perhaps build a separate transfer-route with only 1 consumer that downloads the files. Then your file-consumer route can process them idempotent.
what happen in Camel File Component - Consumer , if a file(1) is in process (so by documentation it is locked : By default the file is locked for the duration of the processing) , and another file(2) with same name is being save (ftp/non ftp action) into the file (1) directory ? will file (2) will overwrite the in process file (1) ?
It depends on the file system you are using, and what kind of locking you are using with Apache Camel. But yeah its always a bad thing to override a file, unless you are sure its okay.
And for FTP then there is less guarantee of locking as a FTP client cannot lock a remote FTP server file.
Study the java.io.File api on file locks, and also how file locks works on the file system you use.
Lets assume that I have a File Consumer that polls a directory every 10 seconds and does some sort of processing to the files it has found there.
This processing may take 40 seconds for each file. This means that during that interval the Cosumer will poll the directory again, and start another similar process?
Is there any way I can avoid that, and not allow the Consumer to poll if the previous poll has not finished?
The file consumer is single threaded so it will not poll while it already process files.
When the consumer finishes it will delay for 10s before polling again. This is controlled by useFixedDelay option which you can read more about in the JDK ScheduledExecutorService which is used by Camel as the scheduler.
I am trying to develop a file receipt process using Camel. what I am trying to do seems simple enough:
receive a file
invoke a web service which will look at that file and generate some metadata
move the file to a new location based on that metadata
invoke subsequent process(es) which will act on the file in it's new location
I have tried several different approaches but none seem to work exactly as I would like. My main issues are that since the file is not moved/renamed until the route is completed, I cannot signal to any downstream process that the file is available within that route.
I need to invoke webservices in order to determine the new name and location, once I do that the body is changed and I cannot use a file producer to move the file from within the route.
I would really appreciate hearing any other solutions.
You can signal the processing routes and then have them poll using the doneFile functionality of the file component.
Your first process will copy the files, signal the processing routes, and when it is done copying the file it will write a done file. Once the done file has been written the file consumers in your processing routes will pick up the file you want to process. This guarantees that the file is written before it is processed.
Check out the "Using done files" section of the file component.
http://camel.apache.org/file2.html
Using other components you could have used the OnCompletion DSL-syntax to trigger a post-route message for futher processing.
However, with the file component, this is not really doable, since the move/done thingy happends in parallell with that "OnCompletion" trigger, and you can't be sure that the file is really done.
You might have some luck with the Unit of Work API which can register post route execution logic (this is how the File component fires of the move when the route is done).
However, do you really need this logic?
I see that you might want to send a wakeup call to some file consumer, but do the file really have to be ready that very millisec? Can't you just start to poll for the file and grab it once ready, once you received the trigger message? That's ususually how you do things with file based protocols (or just ignore the trigger and poll once every now and then).