Akka stream each element to ftp sink - akka-stream

I want to write each element in an Akka stream to a (different) FTP file. Using Alpakka I can write each element to the same file using an FTP sink. However I can not seem to figure out how to write each element to a different file.
source.map(el -> /* to byte string */).to(Ftp.toPath("/file.xml", settings));
So every el should end up in a different file.

If you want to use the Alpakka FTP sink, you have to do something along the lines of
def sink(n: String): Sink[String, NotUsed] = Ftp.toPath(s"$n.txt", settings)
source.runForeach(s ⇒ Source.single(s).runWith(sink(s)))
otherwise, you'll need to create your own sink that establishes an FTP connection and writes the data as part of the input handler. You'll need to create your own graph stage to do it. More info about this can be found in the docs.

Related

Read and parse all files inside S3 paths from Kafka with Flink

I am reading the S3 url from a Kafka producer. Then going to that S3 url to process all files inside that folder. After grabbing the info from each file, I will pass that data to a sink.
Initially, I have a DataStream<String> that will read and grab the nested JSON value from the Kafka source's ObjectNode, using the JSONKeyValueDeserializationSchema. So the path exists as a String inside the DataStream. How do I pass this string to a FileSource? The FileSource object takes in a Path object for the place of the folder.
I'm planning to use FileSource.forRecordStreamFormat to go through all the files and then all the lines of each file. However, this outputs a FileSource<String>, then a DataStream<String> by calling env.fromSourced.
The example I'm looking at now is: https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/sideoutput/SideOutputExample.java
I see that FileSource takes in a Path object and then eventually gets a DataStream<String> but is there a way for me to grab that String from the initial Kafka source DataStream<String> and then use it for a FileSource?

How to use a PipedInputStream in a Producer Endpoint in Apache Camel

I am using Camel to consume a huge (GB) file, process/modify some data in the file and finally forward the modified file via an AWS-S3, FTP or SFTP producer endpoint to its target. In the actual usage scenario using an intermediate(temporary) file holding the processed data is not allowed.
In case of the AWS producer, the configure method of the corresponding RouteBuilder specifies the route as follows:
from("file:/...")
.streamCaching()
.process(new CustomFileProcessor())
.to("aws-s3://...");
In it's process(Exchange exchange) method the CustomFileProcessor reads the input data from exchange.getIn().getBody(InputStream.class) and writes the processed and modified data into a PipedOutputStream.
Now the PipedInputStream connected with this PipedOutputStream should be used as the source for the producer sending the data to AWS-S3.
I tried exchange.getOut().setBody(thePipedInputStream) in the process method but this doesn't work and seems to create a deadlock.
So what is the correct way - if it is possible at all - of piping the processed output data of the CustomFileProcessor to the producer endpoint so that the entire data is send over?
Many thanks in advance.
After further digging into this the solution was quite simple. I only needed to place the pipe reader into a separate thread and the problem was solved.

Akka-Stream stream within stream

I am trying to figure out how to handle a situation where in one of your stage you need to make a call that return an InputStream, where I would deal with that stream as a Source of the stage that comes further down.
e.g.
Source.map(e => Calls that return an InputStream)
.via(processingFlow).runwith(sink.ignore)
I would like that the element going to Processing flow as those coming from the InputStream. This is basically a situation where I am tailing a file, reading each line, the line give me the information about a call I need to make against a CLI api, when making that call I get the Stdout as an InputStream from which to read the result. Result are going to be huge most of the time, so I can just collect the all thing in memory.
you can use StreamConverters utilities to get Sources and Sinks from java.io streams. More info here.
you can use flatMapConcat or flatMapMerge to flatten a stream of Sources into a single stream. More info here.
A quick example could be:
val source: Source[String, NotUsed] = ???
def gimmeInputStream(name: String): InputStream = ???
val processingFlow: Flow[ByteString, ByteString, NotUsed] = ???
source
.map(gimmeInputStream)
.flatMapConcat(is ⇒ StreamConverters.fromInputStream(() ⇒ is, chunkSize = 8192))
.via(processingFlow)
.runWith(Sink.ignore)
However Akka Streams offers a more idiomatic DSL to read/write files in the FileIO object. More info here.
The example becomes:
val source: Source[String, NotUsed] = ???
val processingFlow: Flow[ByteString, ByteString, NotUsed] = ???
source
.flatMapConcat(name ⇒ FileIO.fromPath(Paths.get(name)))
.via(processingFlow)
.runWith(Sink.ignore)

Hadoop Map Whole File in Java

I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}

WSO2 How to transform file

Now I have a local file just like :
<userCode>001</userCode><productCode>001</productCode><Fee>1.00</Fee>
<userCode>002</userCode><productCode>002</productCode><Fee>2.00</Fee>
<userCode>003</userCode><productCode>003</productCode><Fee>3.00</Fee>;
I need transform this file to :
<Fee>1.00</Fee><productCode>001</productCode>
<Fee>2.00</Fee><productCode>002</productCode>
<Fee>3.00</Fee><productCode>003</productCode>
I think I need read first and then write. How to do this in WSO2?
I hope you have a Top level element which wraps this data.
Making this a proper xml.
ex :
<data><userCode>001</userCode><productCode>001</productCode><Fee>1.00</Fee>... </data>
Steps
1) Configure VFS transport sender and receiver in axis2.xml
2) Engage ApplicationXML Message builder and formatter for your content type (This can be any ex : file/xml)
3) Configure a VFS proxy to listen to this content type in a given directory.
4) When message comes use XSLT mediator to do the transformation
5) Use VFS sender to store the transformed file.
thanks,
Charith

Resources