Camel SFTP file transfer, increasing performance - apache-camel

I'm using Apache Camel SFTP component in my project, having a use case like send up to 1000 (<50 KB) files to a remote FTP server every second, but our code seems struggling to do that overtime. We are keeping FTP connection open for reuse.
Is there any way in which we can improve the file transfer performance by opening multiple connections to the same host, something like maxConnectionsPerHost, maxTotalConnections ?

FTP/SFTP producer do support multiple concurrent connection to same target with same setup. You can achieve this by create multiple routes point to same target.
For example, create 10 folders and have 10 routes collect from each folder and send to same target. Then have a route to collect file/message from source and send to the 10 folders evenly.
Other than using concurrent FTP/SFTP connection, probably switch to other protocol (e.g. HTTP, JMS) that are naturally support concurrent multiple connection in high speed is a better choice.

Related

Do websockets work with any data source such as DB2?

I'm starting to learn websockets and I would like to know if they'res supported by a database like DB2 (or some other data-source)
Say I have a Spring Boot application, that provides data to a UI as a service. Typically, I would run SQL SELECT statements every so seconds from the Java application. However, I want to have a stream of data in the table (or perhaps a stream of just the changes made to the table) similar to having an open websocket connection to a Kafka topic .
Is it possible to use something like a STOMP websocket to have a connection opened to a DB2 table where it will stay open and consistently pull data? Does the data-source have to support websockets in order for that to work?
No they do not. RDBMS client-server protocols are more involved than just streaming a load of bytes for the client to interpret.
Having said that, database connections are already persistent, duplex, and stateful, and had been long before the WebSocket protocol was conceived.

libcurl use 1 connection for multiple concurrent requests

I would like to use 1 CURL handle and do let's say 10 concurrent requests with this handle. Would that be possible? The problem is if I want for example 100 concurrent requests it opens too many connections and sometimes server refuses to answer because too many connections are already opened from the same IP. But if I had one handle and use this handle for many requests in multiple threads that would probably solve the problem. Any idea if this is possible?
If you really want to do multiple requests in parallel on the same single connection, you need to use HTTP/2 and all those requests have to be made to the same host. That's an usual situation though. You then need to ask libcurl to use HTTP/2 and you need to use the multi interface. Like in the http2-download.c example.
If you have multiple URLs to different hosts and want to limit the number of connections used to transfer those, you can use the easy interface and get the URLs one by one to keep down the number of used connections - in combination with CURLOPT_MAXCONNECTS.
If you want to use the multi interface, you can still allow libcurl to do a limited amount of parallel transfers with CURLMOPT_MAX_TOTAL_CONNECTIONS and friends, even if you add a hundred easy handles at once. Or you can just limit the number of concurrently added easy handles.

Processing a million records as a batch in BizTalk

I am looking at suggestions on how to tackle this and whether I am using the right tool for the job. I work primarily on BizTalk and we are currently using BizTalk 2013 R2 with SQL 2014.
Problem:
We would be receiving positional flat files every day(around 50) from various partners and the theoretical total number of records received would be over a million records. Each record has some identifying information that will need to be sent to a web service which would come back essentially with a YES or NO based on which the incoming file is split into two files.
Originally, the scope for daily expected records was 10k which later ballooned to 100k and now is at a million records.
Attempt 1: Scatter-Gather pattern
I am debatching the records in a custom pipeline using the file disassembler, adding a couple of port configurable properties for the scatter part(following Richard Seroter's suggestion of implementing a round-robin assignment) where I control the number of scatter/worker orchestrations I spin up to call the web service and mark the records to be sent to 'Agency A' or 'Agency B' and finally push a control message that spins up the Gather/Aggregator orchestration that collects all the messages that are processed from the workers into the messagebox via correlation and creates two files to be routed to Agency A and Agency B.
So, every file that gets dropped will have it's own set of workers and a aggregator that would process the file.
This works well for files with fewer number of records but if a file has over 100k records, I see throttling happen and the file takes a long time to process and generate the two files.
I have put the receive location/worker & aggregator/send port on separate hosts.
It appears to be that the gatherer seems to be dehydrated and not really aggregating the records processed by the workers until all of them are processed and i think since the ratio of msgs published vs processed is very large, it is throttling.
Approach 2:
Assuming that the Aggregator orchestration is the bottleneck, instead of accumulating them in an orchestration, i pushed the processed records to a SQL db and 'split' the records into two XML files(basically a concatenate of msgs going to Agency A/B and wrapping it in XML declaration and using the correct msg type based on writing some of the context properties to the SQL table along with the record).
These aggregated XML records are polled and routed to the right agencies.
This seems to work okay with 100k records and completes in an acceptable amount of time. Now that the goal post/requirement has again changed with regard to expected volume, i am trying to see if BizTalk is even a feasible choice anymore.
I have indicated that BT is not the right tool for the job to perform such a task but the client is suggesting we add more servers to make it work. I am looking at SSIS.
Meanwhile, while doing some testing, some observations:
Increasing the number of workers improved processing(duh):
It looks like if each worker processed a fewer number of records in it's queue/subscription, they finished their queue quickly. When testing this 100k record file, using 100 workers completed in under 3 hrs. This is with minimal activity on the server from other applications.
I am trying to get the web service hosting team to give me a theoretical maximum no of concurrent connection they can handle. I am leaning towards asking them to see if they can handle 1000 calls and maybe the existing solution would scale with my observations.
I have adjusted a few settings for the host with regard to message count and physical memory threshold so it won't balk with the volume but I am still unsure. I didn't have to mess with these settings before and can use advice to monitor any particular counters.
The post is a bit long but I am hoping this gives an idea on what I did so far. Any help/insight appreciated in tackling this problem. If you are suggesting alternatives, i am restricted to .NET or MS based tools/frameworks but would love to hear on other options as well.
I will try to answer or give more detail if you want to clarify or understand something I didn't make clear.
First, 1 million records/messages is not the issue, but you can make it a problem by handling it poorly.
Here's the pattern I would lay out first.
Load the records into SQL Server with SSIS. This will be very fast.
Process/drain the records into you BizTalk app for...well, whatever needs to be done. Calling the service etc.
Update the SQL Record with the result.
When that process is complete, query out the Yes and No batches as one (large) message each, transform and send.
My guess is the Web Service will be the bottleneck unless it's specifically designed for such a load. You will probably have to tune BizTalk to throttle only when necessary but don't worry about that just yet. A good app pattern is more important.
In such scenarios, you should consider following approach:
De-batch the file and store individual records to MSMQ. You can easily achieve this without any extra coding effort, all you need is to create a send port using MSMQ adapter or WCF custom with netmsmq binding. If required, you can also create separate queues depending on different criteria you may have in your messages.
Receive the messages from MSMQ using receive location on a separate host.
Send them to web service on a different BizTalk host.
Try using messaging only scenarios, you can handle service response using a pipeline component if required. You can use Map on send port itself. In worst case if you need orchestration, it should only be to handle one message processing without any complex pattern.
You can again push messages back to two MSMQ for two different agencies based of web service response.
You can then receive those messages again and write them to file, you can simply use a send port with FileAppend option or use a custom pipeline component to write the received messages to file without aggregating them in orchestration. You can gather them in orchestration, if per file you don't have more than few thousand messages.
With this approach you won't have any bottleneck within BizTalk and you don't need to use complex orchestration pattern which usually end up having many persistent points.
If web service becomes a bottleneck, then you can control the rate of received message from MSMQ using 1) Ordered Delivery on MSMQ receive location and if required 2) using BizTalk host throttling by changing two properties Message Count in Db to a very low number e.g. 1000 from 50K default and increasing Spool and Tracking Data Multiplier accordingly e.g. 500 from 10 default to make sure the multiply of both number is enough for not to cause throttling due to messages within BizTalk. You can also reduce the number of worker threads on BizTalk host to make it little slow.
Please note MSMQ is part of Windows OS and does not require any additional setup. Usually installed by default, if not you can add using add-remove features. You can also use IBM MQ if your organization has the infrastructure. But for one million messages, MSMQ will be just fine.
Apologies on the late update*
We've decided to use SSIS to bulk import the file to a table and since the lookup web service is part of the same organization and network although using a different stack, they have agreed to allow us to call their lookup table upon which their web service is based on and we are using a 'merge' between those tables to identify 'Y' or 'N' and export them out via SSIS as well.
In short, we've skipped using BT. The time it now takes is within a couple of mins for a 1.5 million record file to be processed and send the split files.
Appreciate all the advice provided here.

Camel and source/target system availabilites strategy

I am new to Camel and am looking for patterns or strategies to manage the availability of a target system in a Camel route.
For example, say I want:
- to read input data from an file server
- process the data (data -> targetData)
- to send the target data (TargetData) to a target web site using Rest services (call it TargetSystem)
My question is what is the best strategy if the TargetSytem is down?
I understand that if a route fails it is possible to rollback the overall process. But in case TargetSystem is an external system and can be down for hours, I don't think trying to rollback the process untill the target system is up is a good approach.
Is there any pattern or strategy that fits well with this issue?
Regards
Gilles
This is the pattern I use with a couple of systems.
Persist your TargetData somewhere (Database table, JMS queue, Redis, ...)
Create a route that reads unsent TargetData and sends it to TargetSystem
If transfer is OK, mark TargetData accordingly (eg: set a flag in the table row, remove from Queue, etc...)
Trigger such route periodically from a timer and/or other routes. You can even trigger it with a shell command to manually "force" a resend of unsent data.
You can customize this to your needs and for example add logging where appropriate, keep track in a DB table when each tentative send occurs, how many times it failed, how many retries and so on...
You can now modularize your application in 2 modules: one that receives Data and process it to TargetData and another that manages the transfer of TargetData to TargetSystem.
Module can mean 2 CamelContextes, 2 OSGi bundles, 2 totally separate Java applications.

Use SQL Server for file queue storage

We are developing file processing system where several File Processing applications pick up files from queue, do processing and put back file to queue as response. Now we use Windows file system(share a folder on network) as queue. We share one folder and put files in it, the File Processing Servers applications pick up files from it and put back after processing.
We are thinking to move the whole queue engine system from windows file system to SQL Server. Is it good idea to store files into SQL Server and use SQL Server as file queue backend? The files are about 1-20mb in size and our system process about 10 000 files per day.
You can do that, but I'd prefer a queue - either a remote instance or an in-memory object. I would prefer a real queue because I could pool listeners and have the queue hand off requests to them and manage their life cycle. You'll have to write all that code if you put them in a database.
10,000 files per day means you need to process one every 8.64 seconds for 24 hours a day. What are your typical processing times for a 1-20MB file?
The processing of the files should be asynchronous.
If you have 50 listeners, each handling one 20MB file, your total memory footprint will be on the order of 1GB.
As far as speed goes, the worst case is the 15 minutes for processing time. That's four per hour, 96 per day. So you'll need at least 104 processors to get through 10,000 in a single day. That's a lot of servers.
You're not thinking about network latency, either. There's transfer time back and forth for each file. It's four network hops: one from the client to the database, another from the database to the processor, and back again. 20MB could introduce a lot of latency.
I'd recommend that you look into Netty. I'll bet it could help to handle this load.
The file size is quite nasty - unless you can e.g. significantly compress the files, the storage requirements in SQL might outweigh any benefits you perceive.
What you might consider is a hybrid solution, i.e. modelling each incoming file in SQL (fileName, timestamp, uploadedby, processedYN... etc), queuing the file record in SQL after each upload, and then use SQL to do the queueing / ordering (and you can then run audits, reports etc out of SQL)
The downside of the hybrid is that if your file system crashes, you have SQL but not your files.

Resources