Read and parse all files inside S3 paths from Kafka with Flink - apache-flink

I am reading the S3 url from a Kafka producer. Then going to that S3 url to process all files inside that folder. After grabbing the info from each file, I will pass that data to a sink.
Initially, I have a DataStream<String> that will read and grab the nested JSON value from the Kafka source's ObjectNode, using the JSONKeyValueDeserializationSchema. So the path exists as a String inside the DataStream. How do I pass this string to a FileSource? The FileSource object takes in a Path object for the place of the folder.
I'm planning to use FileSource.forRecordStreamFormat to go through all the files and then all the lines of each file. However, this outputs a FileSource<String>, then a DataStream<String> by calling env.fromSourced.
The example I'm looking at now is: https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/sideoutput/SideOutputExample.java
I see that FileSource takes in a Path object and then eventually gets a DataStream<String> but is there a way for me to grab that String from the initial Kafka source DataStream<String> and then use it for a FileSource?

Related

Use of compaction for Parquet bulk format

Since version 1.15 of Apache Flink you can use the compaction feature to merge several files into one.
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#compaction
How can we use compaction with bulk Parquet format?
The existing implementations for the RecordWiseFileCompactor.Reader (DecoderBasedReader and ImputFormatBasedReader) do not seem suitable for Parquet.
Furthermore we can not find any example for compacting Parquet or other bulk formats.
There are two types of file compactor mentioned in flink's document.
OutputStreamBasedFileCompactor : The users can write the compacted results into an output stream. This is useful when the users don’t want to or can’t read records from the input files.
RecordWiseFileCompactor : The compactor can read records one-by-one from the input files and write into the result file similar to the FileWriter.
If I remember correctly, Parquet saves meta information at end of files. So obviously we need to use RecordWiseFileCompactor. Because we need to read the whole Parquet file so we can get the meta information at the end of the file. Then we can use the meta information (number of row groups, schema) to parse the file.
From the java api, to construct a RecordWiseFileCompactor, we need a instance of RecordWiseFileCompactor.Reader.Factory.
There are two implementations of interface RecordWiseFileCompactor.Reader.Factory, DecoderBasedReader.Factory and InputFormatBasedReader.Factory respectively.
DecoderBasedReader.Factory creates a DecoderBasedReader instance, which reads whole file content from InputStream. We can load the bytes into a buffer and parse the file from the byte buffer, which is obviously painful. So we don't use this implementation.
InputFormatBasedReader.Factory creates a InputFormatBasedReader, which reads whole file content using the FileInputFormat supplier we passed to InputFormatBasedReader.Factory constructor.
The InputFormatBasedReader instance uses the FileInputFormat to read record by record, and pass records to the writer which we passed to forBulkFormat call, till the end of the file.
The writer receives all the records and compact the records into one file.
So the question becomes what is FileInputFormat and how to implement it.
Though there are many methods and fields of class FileInputFormat, we know only four methods are called from InputFormatBasedReader from InputFormatBasedReader source code mentioned above.
open(FileInputSplit fileSplit), which opens the file
reachedEnd(), which checks if we hit end of file
nextRecord(), which reads next record from the opened file
close(), which cleans up the site
Luckily, there's a AvroParquetReader from package org.apache.parquet.avro we can utilize. It has already implemented open/read/close. So we can wrap the reader inside a FileInputFormat and use the AvroParquetReader to do all the dirty works.
Here's a example code snippet
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
public class ExampleFileInputFormat extends FileInputFormat<GenericRecord> {
private ParquetReader<GenericRecord> parquetReader;
private GenericRecord readRecord;
#Override
public void open(FileInputSplit split) throws IOException {
Configuration config = new Configuration();
// set hadoop config here
// for example, if you are using gcs, set fs.gs.impl here
// i haven't tried to use core-site.xml but i believe this is feasible
InputFile inputFile = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(split.getPath().toUri()), config);
parquetReader = AvroParquetReader.<GenericRecord>builder(inputFile).build();
readRecord = parquetReader.read();
}
#Override
public void close() throws IOException {
parquetReader.close();
}
#Override
public boolean reachedEnd() throws IOException {
return readRecord == null;
}
#Override
public GenericRecord nextRecord(GenericRecord genericRecord) throws IOException {
GenericRecord r = readRecord;
readRecord = parquetReader.read();
return r;
}
}
Then you can use the ExampleFileInputFormat like below
FileSink<GenericRecord> sink = FileSink.forBulkFormat(
new Path(path),
AvroParquetWriters.forGenericRecord(schema))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.enableCompactionOnCheckpoint(10)
.build(),
new RecordWiseFileCompactor<>(
new InputFormatBasedReader.Factory<>(new SerializableSupplierWithException<FileInputFormat<GenericRecord>, IOException>() {
#Override
public FileInputFormat<GenericRecord> get() throws IOException {
FileInputFormat<GenericRecord> format = new ExampleFileInputFormat();
return format;
}
})
))
.build();
I have successfully deployed this to a flink on k8s and compacted files on gcs. There're some notes for deploying.
You need to download flink shaded hadoop jar from https://flink.apache.org/downloads.html (search Pre-bundled Hadoop in webpage) and the jar into $FLINK_HOME/lib/
If you are writing files to some object storage, for example gcs, you need to follow the plugin instruction. Remember to put the plugin jar into the plugin folder but not the lib foler.
If you are writing files to some object storage, you need to download the connector jar from cloud service supplier. For example, I'm using gcs and download gcs-connector jar following GCP instruction. Put the jar into some foler other than $FLINK_HOME/lib or $FLINK_HOME/plugins. I put the connector jar into a newly made folder $FLINK_HOME/hadoop-lib
Set environment HADOOP_CLASSPATH=$FLINK_HOME/lib/YOUR_SHADED_HADOOP_JAR:$FLINK_HOME/hadoop-lib/YOUR_CONNECTOR_JAR
After all these steps, you can start your job and good to go.

Akka stream each element to ftp sink

I want to write each element in an Akka stream to a (different) FTP file. Using Alpakka I can write each element to the same file using an FTP sink. However I can not seem to figure out how to write each element to a different file.
source.map(el -> /* to byte string */).to(Ftp.toPath("/file.xml", settings));
So every el should end up in a different file.
If you want to use the Alpakka FTP sink, you have to do something along the lines of
def sink(n: String): Sink[String, NotUsed] = Ftp.toPath(s"$n.txt", settings)
source.runForeach(s ⇒ Source.single(s).runWith(sink(s)))
otherwise, you'll need to create your own sink that establishes an FTP connection and writes the data as part of the input handler. You'll need to create your own graph stage to do it. More info about this can be found in the docs.

Hadoop Map Whole File in Java

I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}

How can Apache Camel be used to monitor file changes?

I would like to monitor all of the files in a given directory for changes, ie an updated timestamp. This use case seems natural for Camel using the file component, but I can't seem to find a way to configure this behavior.
A uri like:
file:/some/directory
will consume the files in the provided directory but will delete them.
A uri like:
file:/some/directory?noop=true
consumes each file once when it is added or when the route is started.
It's surprising that there isn't an option along the lines of
consumeOnChange=true
Is there a straightforward way to monitor file changes and not delete the file after consuming?
You can do this by setting up the idempotentKey to tell Camel how a file is considered changed. For example if the file size changes, or its timestamp changes etc.
See more details at the Camel file documentation at: https://camel.apache.org/components/latest/file-component.html
See the section Avoiding reading the same file more than once (idempotent consumer). And read about idempotent and idempotentKey.
So something alike
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:size}")
Or
from("file:/somedir?noop=true&idempotentKey=${file:name}-${file:modified}")
You can read here about the various ${file:xxx} tokens you can use: http://camel.apache.org/file-language.html
Setting noop to true will result in Camel setting idempotent=true as well, despite the fact that idempotent is false by default.
Simplest solution to monitor files would be:
.from("file:path?noop=true&idempotent=false&delay=60s")
This will monitor changes to all files in the given directory every one minute.
This can be found in the Camel documentation at: http://camel.apache.org/file2.html.
I don't think Camel supports that specific feature but with the existent options you can come up with a similar solution of monitoring a directory.
What you need to do is set a small delay value to check the directory and maintain a repository of the already read files. Depending on how you configure the repository (by size, by filename, by a mix of them...) this solution would be able to provide you information about news files and modified files. As a caveat it would be consuming the files in the directory very often.
Maybe you could use other solutions different from Camel like Apache Commons VFS2 (I wrote a explanation about how to use it for this scenario: WatchService locks some files?
I faced the same problem i.e. wanted to copy updated files also (along with new files). Below is my configuration,
public static void main(String[] a) throws Exception {
CamelContext cc = new DefaultCamelContext();
cc.addRoutes(createRouteBuilder());
cc.start();
Thread.sleep(10 * 60 * 1000);
cc.stop();
}
protected static RouteBuilder createRouteBuilder() {
return new RouteBuilder() {
public void configure() {
from("file://D:/Production"
+ "?idempotent=true"
+ "&idempotentKey=${file:name}-${file:size}"
+ "&include=.*.log"
+ "&noop=true"
+ "&readLock=changed")
.to("file://D:/LogRepository");
}
};
}
My testing steps:
Run the program and it copies few .log files from D:/Production to D:/LogRepository and then continues to poll D:/Production directory
I opened a already copied log say A.log from D:/Production (since noop=true nothing is moved) and edited it with some editor tool. This doubled the file size and save it.
At this point I think Camel is supposed to copy that particular file again since its size is modified and in my route definition I used "idempotent=true&idempotentKey=${file:name}-${file:size}&readLock=changed". But camel ignores the file.
When I use TRACE for logging it says "Skipping as file is already in progress...", but I did not find any lock file in D:/Production directory when I editted and saved the file.
I also checked that camel still ignores the file if I replace A.log (with same name but bigger size) in D:/Production directory from outside.
But I found, everything is working as expected if I remove noop=true option.
Am I missing something?
If you want monitor file changes in camel, use file-watch component.
Example -> RECURSIVE WATCH ALL EVENTS (FILE CREATION, FILE DELETION, FILE MODIFICATION):
from("file-watch://some-directory")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}");
You can see the complete documentation here:
Camel file-watch component

Camel: "file" component, but only pass file name as body

I'm trying to process potentially large files using Camel, and am worried about them "fitting" in the body of a Camel Message. Is there a way I can just pass the name (path) of the file as the body of the message, and then in a processor use that to read from disk?
You can just pass in a java.io.File instance. This is essentially what the Camel file component does itself (although its placed inside a WrappedFile, due sharing code with the ftp components).
You can of course also just store the name of the file as a String, and then from the processor access the file, either by
String name = exchange.getIn().getBody(String.class);
File file = new File(name);
...
FileInputStream fis = new FileInputStream(file);
// read the file from the stream, etc.

Resources