1,000,000 files consumption with doneFileName(1:1) multi-threading issue - apache-camel

As part of a performance scenario, my route is consuming 1,000,000 files off an AwsLinux EBS volume. MaxMessagesPerPoll is set at 100, route threads set at 10. A done file is generated for each document. The route can successfully consume all documents.
The issue is, at some point the done file is locked and can’t be deleted by thread A causing thread B, at the next poll, to attempt to process the correlated file again. This fails since thread A eventually deletes the file after successfully processing it.
How can I configure the route to avoid this? Should I move away from done files to a read lock strategy?

Related

How to create large number of entities in Cloud Datastore

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.

Multiple Biztalk host instances writing to single file

We have four Biztalk servers on production envionment. The sendport is configured to write incoming message in one textfile. This port receives thousands of messages in a day. So multiple host instances tries to write to file at single time, before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
...before one instance finishes writing complete record another instances starts writing new record causing data scattered all over the file.
What can we do resolve this issue?
The easy way is to only use a single Host Instance to write data to the file, however you may then start to experience throttling issues. Alternatively, you could explore using the 'Allow Cache on write' option on the File Adapter which may offer some improvements.
However, I think your approach is wrong. You cannot expect four separate and totally disconnected processes (across 4 servers no-less) to reliably append to a single file - IN ORDER.
I therefore think you should look re-architecting this solution:
As each message is received, write the contents of the message to a database table (a simple INSERT) with an 'unprocessed' flag. You can reliably have four Host Instances banging data into SQL without fear of them tripping over each other.
At a scheduled time, have BizTalk extract all of the records that are marked as unprocessed in that SQL Table (the WCF-SQL Adapter can help you here). Once you have polled the records, mark them as 'in-process'.
You should now have a single message containing all of the currently unprocessed records (as retrieved from SQL). Using a single (or multiple) Host Instance/s, write the message to disk, appending each of the records to the file in a single write. The key here is that you are only writing a single message to the one file, not lots and lots and lots :-)
If the write is successful, update each of the records in the SQL table with a 'processed' flag so they are not picked-up again on the next poll.
You might want to consider a singleton orchestration for this piece to ensure that there is only ever one poll-write-update process taking place at the same time.
If FIFO is important, BizTalk has ordered delivery mechanism (FILE adapter supported) but it comes at performance cost.
The better solution would be let instances writing to individual files and then have another scheduled process (or orchestration) to combine them in one file. You can enforce FIFO using timestamps. This would provide better performance and resource utilization vs. mentioned earlier singleton orchestration. Other option may be using any suitable implementation of a queue.
You can move to a database system instead of a file. That would be very simply solution and also very efficient.
If you don't want to go that way, you must implement file locking or a semaphore inside of your application so the new threads will wait for other threads to finish writing.

Any simple way that concurrent consumers can share data in Camel?

We have a typical scenario in which we have to load-balance a set of cloned consumer applications, each running in a different physical server. Here, we should be able to dynamically add more servers for scalability.
We were thinking of using the round-robin load balancing here. But we don't want a long-running job in the server cause a message to wait in its queue for consumption.
To solve this, we thought of having 2 concurrentConsumers configured for each of the server application. When an older message is processed by a thread and a new message arrives, the latter will be consumed from the queue by the second thread. While processing the new message, the second thread has to check a class (global) variable shared by the threads. If 'ON', it can assume that one thread is active (ie. job is already in progress). In that case, it re-routes the message back to its source queue. But if the class variable is 'OFF', it can start the job with the message data.
The jobs are themselves heavyweight and so we want only one job to be processed at a time. That's why the second thread re-routes the message, if another thread is active.
So, the question is 'Any simple way that concurrent consumers can share data in Camel?'. Or, can we solve this problem in an entirely different way?
For a JMS broker like ActiveMQ you should be able to simply use concurrent listeners on the same queue. It should do round robin but only with the consumers that are idle. So basically this should just work. Eventually you have to set the prefetch size to 1 as prefetch might cause a consumer to take messages even when a long running process will block them.

solr optimize command and concurrency

I know that when Solr performs optimization, either explicitly by the optimize command, or implicitly by Lucene due to the mergeFactor, readers are not blocked. That is, the server is still available for searching
Is it also available for updates? Can other threads in my application send documents updates to solr, and possibly also send commits? Will those updates pass through into the index, or will they be blocked?
An old question though, however, some more info can help here.
optimize command in solr is a call to IndexWriter's forceMerge() method. This method does take a lock on the IndexWriter instance itself. However, the point is that adding documents does not require any lock on the IW instance, neither does it need any commitLock or fullFlushLock.
Moreover, even with forceMerge(), it is the ConcurrentMergeScheduler which picks up the merge process and does it in a different thread altogether.
Usually merge process (Not the forceMerge, which is not recommended anyway) needs to lock the IndexWriter instance only while preparing the merge info, when it needs to know what segments to take for merge, and what is the new merged segment name etc. Once it has this information, merge happens concurrently.
So, yes, you can keep adding documents even when optimize is in process - they will get buffered in RAM until the next commit/optimize or close() of IndexWriter.
Having said that, might as well add that you can not have concurrent commits to different segments - that is Lucene will do only one commit at a time. Adding documents does not flush them to any segment at all - just puts them in buffer.
The answer is "Yes". The server will be respond to search requests, but updated documents will not show up in the search results until you send a commit command. The updated documents will stack up and be committed whenever a client/thread issues a commit command to the server. If you have multiple clients/threads issuing updates and commits they will not block each other, and the updates will show as soon as the commit command completes.

What is the best way to do simultaneous file access on AS400?

I am trying to write a program which will be run in batch on AS400. This program is going to write a record into a file to reflect its processing status, say, when it is just submitted it adds a record saying it is currently running, and when it is done it updates the same record saying it has finished. If I want to submit this program into batch for multiple times, what is the best way to cope with this type of simultaneous file access to increase the efficiency? I don't want a job to lock the whole file and stops others from updating it in the same time. It can lock the record it needs and leave the rest to others. How to achieve this? RPGLE or QMQRY? Or any other methods?
RPG will not lock the entire file, only the record.
Personally, I'd recommend SQL for (pretty much) all file access, even through RPG. IBM hasn't been updating their Native I/O for a while, just concentrating on the SQL side of things.
Because, during normal use, record locks in RPG are released once the write or update has been performed, you should probably just have your SQL run WITH NC (no commit). You need a way to tie a processing job with the data it's processing anyways (assuming stuff is long-running enough that things are in files outside of QTEMP) - you want to be able to pick up where you left off, if your job dies (so, you can't rely on holding the lock as a control mechanism). So don't forget that you're going to need some sort of monitor job (that can at least report the status, if not resubmit things - look at the QUSRJOBI API).
If you're doing this because you're using all Native I/O, and processing huge sets of data (not huge, processor intensive calculations), consider re-writing everything to SQL. Seriously. You can get way better performance - we've taken a process that used to run for 25+ hours, to something that runs in 2.5ish.

Resources