How to write to different files based on content for batch processing in Flink? - apache-flink

I am trying to process some files on HDFS and write the results back to HDFS too. The files are already prepared before job starts. The thing is I want to write to different paths and files based on the file content. I am aware that BucketingSink(doc here) is provided to achieve this in Flink streaming. However, it seems that Dataset does not have a similar API. I have found out some Q&As on stackoverflow.(1, 2, 3). Now I think I have two options:
Use Hadoop API: MultipleTextOutputFormat or MultipleOutputs;
Read files as stream and use BucketingSink.
My question is how to make a choice between them, or is there another solution ? Any help is appreciated.
EDIT: This question may be a duplicate of this .

We faced the same problem. We too are surprised that DataSet does not support addSink().
I recommend not switching to Streaming mode. You might give up some optimizations (i.e Memory pools) that are available in batch mode.
You may have to implement your own OutputFormat to do the bucketing.
Instead, you can extend the OutputFormat[YOUR_RECORD] (or RichOutputFormat[]) where you can still use the BucketAssigner[YOUR_RECORD, String] to open/write/close output streams.
That's what we did and it's working great.
I hope flink would support this soon in Batch Mode soon.

Related

How to make an automatic savepoint in Flink Stateful Functions application?

I am trying to dive into the new Stateful Functions approach and I already tried to create a savepoint manually (https://ci.apache.org/projects/flink/flink-statefun-docs-release-2.1/deployment-and-operations/state-bootstrap.html#creating-a-savepoint).
It works like a charm but I can't find a way how to do it automatically. For example, I have a couple millions of keys and I need to write them all to savepoint.
Is your question about how to replace the env.fromElements in the example with something that reads from a file, or other data source? Flink's DataSet API, which is what's used here, can read from any HadoopInputFormat. See DataSet Connectors for details.
There are easy-to-use shortcuts for common cases. If you just want to read data from a file using a TextInputFormat, that would look like this:
env.readTextFile(path)
and to read from a CSV file using a CsvInputFormat:
env.readCsvFile(path)
See Data Sources for more about working with these shortcuts.
If I've misinterpreted the question, please clarify your concerns.

Consume multiple text files with Apache Flink DataSet API

I am writing a batch job with Apache Flink using the DataSet API. I can read a text file using readTextFile() but this function just read one file at once.
I would like to be able to consume all the text files in my directory one by one and process them at the same time one by one, in the same function as a batch job with the DataSet API, if it is possible.
Other option is implement a loop doing multiple jobs, one for each file, instead of one job, with multiples files. But I think this solution is not the best.
Any suggestion?
If I got the documentation right you can read an entire path using ExecutionEnvironment.readTextFile(). You can find an example here: Word-Count-Batch-Example
References:
Flink Documentation
Flink Sources

JPG PDF Files in Storm

I wanted to know if it's possible to manipulate JPG files in Storm? Should we expect any issues if JPG or PDF files are being transmitted from bolt to bolt? We are manipulating these files in very large volume, and need a distributed platform to keep up.
From my understanding, messages (and hopefully files) go into in memory queues between bolts.
Has anyone tried to pass JPG or PDF files between bolts in Storm? Are there any limitations that would prevent this from working? If not Storm, can anyone recommmend an appropriate platform?
Thank you for your help!
I have never tried this, but I did some experiments with large tuples which worked fine. I would not expect any problems. As long as you can provide appropriate (de)serialization (best via Kryo), Storm does not care what data it is. To Storm, everything locks like a bunch of bytes anyway (except for key attributes that are used for fieldsGrouping).
You might also check out Apache Flink (Disclaimer: I am a contributer)

How to process multiple text files at a time to analysis using mapreduce in hadoop

I have lots of small files , say more than 50000. i need to process these files at a time using map reduce concept to generate some analysis based on the input files.
Please suggest me a way to do this and also please let me know how to merge this small files into a big file using hdfs
See this blog post from cloudera explaining the problem with small files.
There is a project in github named FileCrush which does merge large number of small files. From project's homepage:
Turn many small files into fewer larger ones. Also change from text to sequence and other compression options in one pass.

How to move large number of files files on disk to HDFS sequence files

I want to move large number of small files to HDFS sequence file(s). I have come across two options:
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Use apache camel file to hdfs route.
Even though the above two methods serve the purpose, I would like to weigh other options available before picking one. In particular i am interested in a solution that is more configurable and results in less maintainable code.
Use Flume. Flume does not have a built in file source and this requires a custom source to push the files.
Umm... no, that's not right. Flume has a Spooling Directory Source which would do the high level of what you want.
Seems like a few lines of code with Camel. i.e. from("file:/..").to("hdfs:..") plus some init and project setup.
Not sure how much easier (less lines of code) you can do it using any method.
If the HDFS options in Camel is enough for configuration and flexibility, then I guess this approach is the best. Should take you just a matter of hours (or even minutes) to have some test cases up and running.

Resources