Processing large compressed files in apache camel - apache-camel

I am trying to get a single file with .zip compression from a ftp server and trying to store it in S3 with .gzip compression using camel.
Following is the route I currently have.
from("sftp://username#host/file_path/?password=<password>&noop=true&streamDownload=true")
.routeId("route_id")
.setExchangePattern(ExchangePattern.InOut)
.unmarshal().zipFile()
.marshal().gzip()
.to("aws-s3://s3_bucket_name?amazonS3Client=#client");
This works fine for smaller files. But I have files that are ~700 MB in size when compressed. For files of that size I get OutOfMemoryError for Java heap space
I know there is a streaming option in camel (.split(body().tokenize("\n")).streaming()) but I am not sure if I can umarshal and marshal while streaming. (I see a similar solution here but in this case the source file is plain text / csv).
The second part to the problem is streaming the file back to S3. I am aware of the multiPartUpload option in camel-aws component but it seems to require the source to be a file. I do not know how to achieve that.
Can this be achieved without processing (unzipping and then gzipping) the file using java code in a custom processor ?
Environment: Camel 2.19.3, Java 8
Thanks

I solved it using streamCaching(). So the way I would do that is
from('xyz')
.streamCaching()
.unmarshall().gzip()
.to('abc')
.end()

Related

How should I go about streaming large local video files in React?

I've recently taken up leaning some more JS and started working on a video player using React.But I am stumped on this issue I am having; I can not import files larger than 2gb and I cannot find any information on streaming the files from local directories, every doc I find either uses smaller than 2gb files, or uses remote files, like a w3.org link.
How should I go about streaming said files. Should I be using HLS, DASH, or something else?
As an example:
import video from './content/filename.mp4';
This will throw the error saying:
RangeError [ERR_FS_FILE_TOO_LARGE]: File size (filesize) is greater
than 2 GB
Making it impossible for me to playback in the default video player or something like video-react or videojs. The one solution I can think up is splitting the files into chunks, but I can't seem to find any reliable ways to do that.
In normal html and js, the same files load normally (I dynamically load the src based on the query parameters) and play in the default video player, so I assume this is a Webpack limit to avoid crashing the browser.
Thanks in advance!

Reading files from S3 bucket with camel

I am new to Camel and need some guidance. I need to read some files from an S3 bucket. The structure is like so.
S3 Bucket
```
Incoming
+xls
-file1.xls
-file2.xls
-file3.xls
+doc
-file1.doc
-file2.doc
-file3.doc
Processed
+xls
...
+doc
...
When a particular excel file is dropped into the incoming/xls folder (say file1.xls), I need to pick up all the files, do some processing and drop them into a processed folder with the same directory structure.
What components do I need to use for this? I tried reading the documentation but its a little difficult to figure out what components I need. I understand that I will use the camel-aws-s3 plugin but there are not many examples of it out there.
On the https://camel.apache.org/components/latest/aws-s3-component.html there some examples about writing and reading from a S3 Bucket.
Next to reading and writing to S3, you might need some custom processor that uses Apache POI to transform the xsl files

How to write to different files based on content for batch processing in Flink?

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .
We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

How to move large number of files files on disk to HDFS sequence files

I want to move large number of small files to HDFS sequence file(s). I have come across two options:
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Use apache camel file to hdfs route.
Even though the above two methods serve the purpose, I would like to weigh other options available before picking one. In particular i am interested in a solution that is more configurable and results in less maintainable code.
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Umm... no, that's not right. Flume has a Spooling Directory Source which would do the high level of what you want.
Seems like a few lines of code with Camel. i.e. from("file:/..").to("hdfs:..") plus some init and project setup.
Not sure how much easier (less lines of code) you can do it using any method.
If the HDFS options in Camel is enough for configuration and flexibility, then I guess this approach is the best. Should take you just a matter of hours (or even minutes) to have some test cases up and running.

Resources