I'm using a File Inbound Endpoint in Mule to process files from one directory and after processing move the files to another directory. The problem I have is that sometimes there's a lot of files in the "incoming directory" and when MULE starts up it tries to process them concurrently. This is no good for the DB accessed and updated in the flow. Can the files be read in sequence, no matter what order?
Set the flow processing strategy to synchronous to ensure the file poller thread gets mobilized across the flow.
<flow name="filePoller" processingStrategy="synchronous">
On top of that, do not use any <async> block or one-way endpoint downstream in the flow, otherwise, another thread pool will kick in, leading to potential (and undesired for your use case) parallel processing.
Related
We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.
I have a job that creates files on the network folder from various non-database sources. Within the job, I isolate the various file creation tasks (contained in a sequence container) from the move file task (foreach enumerator) in order to prevent a spiders web of precedence constraints from the various file creation tasks:
Data flow task that contains script component using C# and LDAP to pull data from Active Directory and output it to multiple files
Script Component that downloads files from SFTP (implements WinSCPNET.dll)
Upon successful completion, the sequence container then goes to a foreach file enumerator to move the extracted files to a folder that indicates files are ready for loading - there is no problem here.
However, an intermittent problem arose in production where the AD connection was terminating before the file extract process completed, resulting in partial files (this was not observed in testing, but should have been contemplated - my bad). So, I added a foreach enumerator outside of the sequence container with a failure precedence constraint to delete these partial extract files.
During testing of this fix, I set one of the tasks within the sequence container to report failure. Initially, the sequence container reported success, thus bypassing the delete foreach enumerator. I tried setting the MaximumErrorCount from 0 to 1, but that did not result in the desired behavior change. I then changed the sequence container's TransactionOption from supported to required and this appears to have fixed the problem. Now, the job moves files that are completely extracted while deleting files that report and error on extraction.
My question is this: Is there a potential problem going this route? I am unsure as to why this solution works. The documentation online discusses the TransactionOption in the context of a connection to the database. But, in this case there is no connection to the database. I just don't want to release a patch that may have a potential bug that I am not aware of.
Regarding Transactions and Files.
Presume you write your files to disk with an NTFS or another file system supporting transactions. Then all file create and file save actions are enclosed into one transaction. Had the transaction failed due to task failure, all the files created inside the transaction will be rolled back, i.e. deleted.
So, you will have an "all or nothing" approach on files, receiving files only if all extractions worked out.
In case you store the files on non-transactional file system, like old FAT, this "all or nothing" will no longer work and you will receive partial set of files. Transaction set on Sequence will have no such effect.
I can see similar problems in different variations but haven't managed to find a definite answer.
Here is the usecase:
SFTP server that I want to poll from every hour
on top of that, I want to expose a REST endpoint that the user can hit do force an ad-hoc retrieval from that same SFTP. I'm happy with the schedule on the polling to remain as-is, i.e. if I polled, 20 mins later the user forces refresh, the next poll can be 40 mins later.
Both these should be idempotent in that a file that was downloaded using the polling mechanism should not be downloaded again in ad-hoc pull and vice-versa. Both ways of accessing should download ALL the files available that were not yet downloaded (there will likely be more than one new file - I saw a similar question here for on-demand fetch but it was for a single file).
I would like to avoid hammering the SFTP via pollEnrich - my understanding is that each pollEnrich would request a fresh list of files from SFTP, so doing pollEnrich in a loop until all files are retrieved would be calling the SFTP multiple times.
I was thinking of creating a route that will start/stop a separate route for the ad-hoc fetch, but I'm not sure that this would allow for the idempotent behaviour between routes to be maintained.
So, smart Camel brains out there, what is the most elegant way of fulfilling such requirements?
Not a smart camel brain, but I would give a try as per my understanding.
Hope, you already went through:
http://camel.apache.org/file2.html
http://camel.apache.org/ftp2.html
I would have created a filter, separate routes for consumer and producer.
And for file options, I would have used: idempotent, delay, initialDelay, useFixedDelay=true, maxMessagesPerPoll=1, eagerMaxMessagesPerPoll as true, readLock=idempotent, idempotent=true, idempotentKey=${file:onlyname}, idempotentRepository, recursive=false
- For consuming.
No files will be read again! You can use a diversity of options as documented and try which suits you the best, like delay option. If yo
"I would like to avoid hammering the SFTP via pollEnrich - my understanding is that each pollEnrich would request a fresh list of files from SFTP, so doing pollEnrich in a loop until all files are retrieved would be calling the SFTP multiple times." - > Unless you use the option disconnect=true, the connection will not be terminated and you can either consume or produce files continously, check ftp options for disconnect and disconnectOnBatchComplete.
Hope this helps!
I am using camel over a clustered environment and want to use a readlock on my file consumer endpoint so only one server tries to process each file.
The only cluster safe readlock is the idempotent readlock however this requires an idempotentRepository to be set on the file uri.
I use an idempotent consumer within the route which moves any duplicate files to an error folder and logs the error to a specific file. This uses a specified JDBCMessageIdRepository to store the idempotent keys.
Is there a way to use the duplicate handling logic from the idempotent consumer with the idempotent readlock? Or a way to set the idempotentRepository in the file component to not skip the duplicates so they are picked up by the idempotent consumer in the route instead?
If you're using the same idempotent repository keys for locking and consuming purposes then you won't be able to try processing the same file twice - idempotent consumer's check will happen after read lock's check so your file will be skipped before the consumer gets to check if it exists or not. Also, I think you may encounter issues with the consumer and read lock itself since the consumer may report your file as existing since the read lock will insert the row in the DB before the consumer does the check.
The fastest solution that I can think of is to use different keys for idempotent consumer and read lock - that way there won't be any conflict between them - and also make read lock remove the key on commit by setting readLockRemoveOnCommit to true. This way, the read lock will not allow concurrent processing of the file but will use idempotent consumer's key to check whether the file has been processed before or not.
I have a polling service that checks a directory for new files, if there is a new file I call SSIS.
There are instances where I can't have SSIS run if another instance of SSIS is already processing another file.
How can I make SSIS run sequentially during these situations?
Note: parallel SSIS's running is fine in some circumstances, while in others not, how can I achieve both?
Note: I don't want to go into WHEN/WHY it can't run in parallel at times, but just assume sometimes it can and sometimes it can't, the main idea is how can I prevent a SSIS call IF it has to run in sequence?
If you want to control the flow sequentially, think of a design like where you can enqueue requests (for invoking SSIS) to a queue data structure. At a time, only the top request from the queue will be processed. As soon as that request completes, next request can be dequeued.