Reading file content using file name from db in camel - apache-camel

I have to do a query in db to get the filenames from a table. And then I have to read the contents of files in a folder/directory using the file names I got from query. I have done the query part and stored the list of filenames in Exchange using a bean. But I am wondering how can I use this filenames in exchange to read the file contents. Could you please help?

You can use the pattern Content Enricher (http://camel.apache.org/content-enricher.html) and Camel 2.16 with dynamic endpoints to load the contents of a file, by path previously obtained from the database.
UPDATED
You have to use the pollEnrich (because file component is a polling consumer) to consume files from uri and you can use an expression (such as Simple) to configure the fileName.
You can try something like that (only for Camel version greater than 2.16):
.pollEnrich("file:?fileName=${header.FILE_NAME}", 1000, new YourAggregationStrategy())

Related

How to set source path dynamically to read files from different folders in flink?

I want to read files everyday, so the path should be always set for current date. How can I do it in flink?
I think you can either generate the path of the file you expect to exist, and then read it, or set up a streaming job that uses file discovery to ingest files as they become available.
See the docs for the FileSource and readfile. readfile is the legacy connector for ingesting files; FileSource was introduced in Flink 1.12.
fileInputFormat.setFilesFilter(filter);
I decided to use this but not sure it is a best practice.

Consume multiple text files with Apache Flink DataSet API

I am writing a batch job with Apache Flink using the DataSet API. I can read a text file using readTextFile() but this function just read one file at once.
I would like to be able to consume all the text files in my directory one by one and process them at the same time one by one, in the same function as a batch job with the DataSet API, if it is possible.
Other option is implement a loop doing multiple jobs, one for each file, instead of one job, with multiples files. But I think this solution is not the best.
Any suggestion?
If I got the documentation right you can read an entire path using ExecutionEnvironment.readTextFile(). You can find an example here: Word-Count-Batch-Example
References:
Flink Documentation
Flink Sources

Processing large compressed files in apache camel

I am trying to get a single file with .zip compression from a ftp server and trying to store it in S3 with .gzip compression using camel.
Following is the route I currently have.
from("sftp://username#host/file_path/?password=<password>&noop=true&streamDownload=true")
.routeId("route_id")
.setExchangePattern(ExchangePattern.InOut)
.unmarshal().zipFile()
.marshal().gzip()
.to("aws-s3://s3_bucket_name?amazonS3Client=#client");
This works fine for smaller files. But I have files that are ~700 MB in size when compressed. For files of that size I get OutOfMemoryError for Java heap space
I know there is a streaming option in camel (.split(body().tokenize("\n")).streaming()) but I am not sure if I can umarshal and marshal while streaming. (I see a similar solution here but in this case the source file is plain text / csv).
The second part to the problem is streaming the file back to S3. I am aware of the multiPartUpload option in camel-aws component but it seems to require the source to be a file. I do not know how to achieve that.
Can this be achieved without processing (unzipping and then gzipping) the file using java code in a custom processor ?
Environment: Camel 2.19.3, Java 8
Thanks
I solved it using streamCaching(). So the way I would do that is
from('xyz')
.streamCaching()
.unmarshall().gzip()
.to('abc')
.end()

how to implement file search functionality?

I am working on file search engine functionality.I need your suggestions in designing my application.
I am using elastisearch as framework to implement my functionality.
My primary feature is to enable file search based on file name , file type, size and date of creation. I also need to enable searching based on content of file.
Please suggest what can be best possible file to do the indexing and extract file data.
Also since file can be deleted/updated so i would need to generate the index again in some time interval so how can i monitor any change in directory.
I am using SAMBA as my file storage system.
To have the search option in file content you need to index the file into elasticsearch index.
Look in to the Mapper Attachment plugin and this will help you to index the files and make it searchable.
Step01: install the plugin in to your elasticsearch cluster
Step02: convert the files as byte[] and sent it to elasticsearch index
Step03: Now you can search using the file content using normal queries.
Note: This will work only for text based files like pdf, word (doc,docx) & text files. if the pdf files contains text in images it will not be searchable.

Play! + GAE + File Upload

Usually with the Play framework, when you upload a file, it appears as a File object to the controller, and the file itself is stored in a tmp folder. In GAE this won't work because GAE does not allow writing to the filesystem.
How would one upload a file and access the stream directly in the controller?
So i figured out the solution. In the controller, instead of passing in a File object, you just pass in a byte[], and use a ByeArrayInputStream to get that into a more usable form. In my case I needed to pass in the file data to a csv parser which takes an InputStream.
i'm not familiar with the play framework either but generally, for multipart requests (e.g. file uploads),
the data from the inputstream is written to a temporary file on the local filesystem if the input size is large enough
the request is then dispatched to your controller
your controller gets a File object from the framework. (this file object is pointing to the temporary file)
for the apache commons upload, you can use the DiskFileItemFactory to set the size threshold before the framework decides whether to write the file to disk or keep it in memory. If kept in memory, the framework copies the data to a DataOutputStream (this is done transparently so your servlet will still be working with the File object without having to know whether the file is on disk or in memory).
perhaps there is a similar configuration for the play framework.

Resources