Apache camel SFTP file upload consuming lot of heap memory

Apache camel SFTP file upload consuming lot of heap memory - apache-camel

I am using apache camel to send file from local machine to sftp server. The file size i am trying with is 100MB , and i observed sudden CPU usage spike and lot of heap memory being consumed , can you help how can i improve the performance here.
IS file content being read into memory ? if yes how to avoid that
because i don't want to process anything in file, i am just using this library just to push to server.
This is how i configured my route:
from("file://tmp?delete=true&localWorkDirectory=/tmp&antInclude=*.csv&readLock=changed&readLockTimeout=70000&readLockCheckInterval=1000")
.onException(Exception.class)
.log("something went wrong")
.end()
.onCompletion()
.log("file successfully delivered ${file:name}")
.end()
.log("processing file ${file:name}")
.to("sftp://127.0.0.1:22/test/in?jschLoggingLevel=INFO&password=xxxxxx&useUserKnownHostsFile=false&username=test");
jconsole output

Java is a garbage collected language, so if there is plenty of memory available, the garbage collector will not be too eager to clean up unused memory. The memory will only be freed when the garbage collector runs.
You can determine the lower bound of memory required by your application with the -Xmx java option, preferably getting it lower than the size of the file you are trying to send, if you want evidence that camel delivers the file via a stream, and does not load the entire file contents into memory at the same time.
Example:
java -Xmx100m -jar YourJarFile.jar

Related

Requesting size of Stream from consumer, before loading all the data

Hy all,
I have a problem with the camel component I am developing, where I'm not sure how to implement it in a way, that goes in line with the concepts of camel.
The producer I'm developing talks to the http api for our server, which is used to send messages with attachments.
Those attachments can potentially be very big, which is why the server expects the total filesize before any upload is done.
Currently the producer only accepts io.Files, nio.Paths and GenericFile, because there I can read the file size, before I upload the file.
Of course this is not a good way to do things, because it requires the (big) file to be available locally.
Connecting, for a example, a ftp server as the consumer would mean, that I have to download each file locally so I can upload it afterwards.
The obvious solution is using streams to access and upload the data, but with this I do not know how big the file is, before I'm done uploading, which is not an option, I need the size in advance.
My question now is, what are best practices to stream files through camel and also make the consumer give me the filesize in advance.
Greets
Chris

For File/FTP consumer, the exchange in header has a key CamelFileLength (Exchange.FILE_LENGTH) which return the file size in remote ftp server from consumer's scan result.
Unlike the file size obtain from local, the file size in key CamelFileLength might differ from actual file size your application received
The ASCII mode will potentially change the linefeed when there is OS differ
The file size might change between consumer scan action and consumer pick actionn

Multiple processes read/write files. What API to use?

I have this situation when I need to spawn worker processes. On one side, worker processes should read evenly split parts of a file and pass data to socket connection. Other side should read that data and write it in parallel. I plan to split source file into parts beforehand so that each process gets only one part of a file to read from or write to.
So I'm already using sockets with read/write. From that, I think, it is better for me to continue to use this simple API. But I can not find any means of setting file pointer when using file descriptors. I need that obviously, when reading from file that is divided to read/write parts.
I've heard that mmap can help me somehow. But to my understanding mmap needs much RAM and my app will run multiple mentioned transfers. The app is also quite limited in CPU usage.
The question is, what API should I use?
EDIT I be on Linux. Filesystem is ext4.

File System with "Corruption" protection

I am developing a system which copies and writes files on NTFS inside a virtual machine. At any time the VM can poweroff (direct shutdown). The poweroff is controlled from the outside so I do not have any way to detect it. Due to that files and complete directories which are being written to get lost. Is there any way to prevent that or do I have to develop my own file system? I have to store the files on the local disk and cannot send files via network.

There always exists a [short] period between when your data is written (sent to the API) and when this data is written to the physical hardware. If the system crashes in the middle, the data will be lost.
There is a setting in Windows to disable system write cache for certain disks. This setting can help you ensure that the data is at least sent to the host's hardware. Probably that's the answer you've been looking for.
Writing your own filesystem won't help much because it's mainly the write cache that causes the data to be lost. There can exist a filesystem-level cache as well, though, and I don't know if the write cache setting I mentioned above also affects internal filesystem cache.

If you write data to a file opened with "write through" enabled, the method only returns after the data is physically written to the disk so you can be sure it got written. You normally do that by passing in a WRITE_THROUGH flag when you open the file.

How does main stream web server implement this feature?

This means, for example, a module can
start compressing the response from a
backend server and stream it to the
client before the module has received
the entire response from the backend.
Nice!
I know it's some kind of asynchronous IO but simple like that isn't enough.
Anyone knows?

Without looking at the source code of an actual implementation, I'm speculating here:
It's most likely some kind of stream (abstract buffered IO) that is passed from one module to the other ("chaining"). One module (maybe a servlet container) writes to a stream that is read by another module (the compression module in your example), which then writes its output to another stream. The contents of that stream may then be processed further or transmitted to the client.

The backend may need to wait on IO before it can fully produce the page. Modules can begin compressing the start of the message before the backend is entirely done writing it.
To understand why this is useful, you need to understand how ngnix is structured. ngninx is a server that relies on non-blocking input and output. Normally, a server will use blocking input and output: it will listen on a connection, and when a connection is found, it will process the page. In order to increase throughput, multiple threads are spawned, called 'workers'.
Contrast this to ngnix: It continually asks the kernel, "Are any of my IO requests ready?" This allows it to handle the same amount of pages with 1) less overhead from all the different processes, and 2) lower memory usage. It has some downsides, however. For extremely low-volume applications, ngnix may use more CPU than a blocking server. Second, it's much less portable. Windows uses an entirely different model for non-blocking IO.
Getting back to your original question, compressing the beginning of a page is useful because it can be ready for the rest of the page when it's done accessing a database or reading from a disk or what-have-you.

What's the most efficient file logging in a server written in C?

I wonder what the most efficient file logging strategy would be in a server written in C?
I can see the following options:
fopen() append and then fwrite() the data for a time frame of say 1 hour, then fclose()?
Caching the data and then occasionally open() append write() and close()?

Using a thread is usually a good solution, we adopted it with interesting results.
The main thread that needs to log prepare the log string and passes it to a second thread. To feed the second thread we use a lockless queue + a circular memory in order to minimize amount of alloc/free and wait time.
The secon thread waits for the lockless queue to be available. When it finds there's some job to do, a new slot of the lockless queue is consumed and the data logged.
Using a separate thread you can save a great amount of time.
After we decided to use a secon thread we had to face another problem. Many istances of the same program (a full text serach engine) must log all together on the same file so the resource shoud be regularly shared among every instance of the server.
We could decide to use a semaphore or another syncornizing methiod but we found another solution: the second thread sends an UDP packet to a local log server that listen on a known port. This server reads each message and logs it on the file (the server is actually the only one that owns he file while it's written). The UDP socket itself grants serialization of logs.
I've been using this solution for more than 10 years and never loose a single line of my logs file, using the second thread I also saved a great percentage of time for every operation (we use to log a lot of information for any single command the server receives).
HTH

Why don't you directly log your data when the events occur?
If your server crashes, you want to retrieve those data at the time it crashed. If you only flush your buffered logs once an hour, you'll miss interesting logs.
File streams are usually buffered by the OS.
If you believe it makes your server slow, due to hard drive writing, you might consider to log into a separate thread. But I wonder if it is the problem. Premature optimizations?

Unless you've benchmarked and found that it's a bottleneck, use fopen and fprintf. There's no reason to put your own complex buffering layer on top unless stdio is too slow for you (and if it is too slow, you might consider whether to rethink the OS/C library your server is running).

The slowest part of writing a system log is the output operation to the physical disks.
Buffering and checksumming the log records are necessary to ensure that you don't lose any log data and that the log data can't be tampered with after the fact, respectively.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Apache camel SFTP file upload consuming lot of heap memory - apache-camel

Related

Requesting size of Stream from consumer, before loading all the data

Multiple processes read/write files. What API to use?

File System with "Corruption" protection

How does main stream web server implement this feature?

What's the most efficient file logging in a server written in C?

Categories

Resources