Use of compaction for Parquet bulk format - apache-flink

Since version 1.15 of Apache Flink you can use the compaction feature to merge several files into one.
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#compaction
How can we use compaction with bulk Parquet format?
The existing implementations for the RecordWiseFileCompactor.Reader (DecoderBasedReader and ImputFormatBasedReader) do not seem suitable for Parquet.
Furthermore we can not find any example for compacting Parquet or other bulk formats.

There are two types of file compactor mentioned in flink's document.
OutputStreamBasedFileCompactor : The users can write the compacted results into an output stream. This is useful when the users don’t want to or can’t read records from the input files.
RecordWiseFileCompactor : The compactor can read records one-by-one from the input files and write into the result file similar to the FileWriter.
If I remember correctly, Parquet saves meta information at end of files. So obviously we need to use RecordWiseFileCompactor. Because we need to read the whole Parquet file so we can get the meta information at the end of the file. Then we can use the meta information (number of row groups, schema) to parse the file.
From the java api, to construct a RecordWiseFileCompactor, we need a instance of RecordWiseFileCompactor.Reader.Factory.
There are two implementations of interface RecordWiseFileCompactor.Reader.Factory, DecoderBasedReader.Factory and InputFormatBasedReader.Factory respectively.
DecoderBasedReader.Factory creates a DecoderBasedReader instance, which reads whole file content from InputStream. We can load the bytes into a buffer and parse the file from the byte buffer, which is obviously painful. So we don't use this implementation.
InputFormatBasedReader.Factory creates a InputFormatBasedReader, which reads whole file content using the FileInputFormat supplier we passed to InputFormatBasedReader.Factory constructor.
The InputFormatBasedReader instance uses the FileInputFormat to read record by record, and pass records to the writer which we passed to forBulkFormat call, till the end of the file.
The writer receives all the records and compact the records into one file.
So the question becomes what is FileInputFormat and how to implement it.
Though there are many methods and fields of class FileInputFormat, we know only four methods are called from InputFormatBasedReader from InputFormatBasedReader source code mentioned above.
open(FileInputSplit fileSplit), which opens the file
reachedEnd(), which checks if we hit end of file
nextRecord(), which reads next record from the opened file
close(), which cleans up the site
Luckily, there's a AvroParquetReader from package org.apache.parquet.avro we can utilize. It has already implemented open/read/close. So we can wrap the reader inside a FileInputFormat and use the AvroParquetReader to do all the dirty works.
Here's a example code snippet
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
public class ExampleFileInputFormat extends FileInputFormat<GenericRecord> {
private ParquetReader<GenericRecord> parquetReader;
private GenericRecord readRecord;
#Override
public void open(FileInputSplit split) throws IOException {
Configuration config = new Configuration();
// set hadoop config here
// for example, if you are using gcs, set fs.gs.impl here
// i haven't tried to use core-site.xml but i believe this is feasible
InputFile inputFile = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(split.getPath().toUri()), config);
parquetReader = AvroParquetReader.<GenericRecord>builder(inputFile).build();
readRecord = parquetReader.read();
}
#Override
public void close() throws IOException {
parquetReader.close();
}
#Override
public boolean reachedEnd() throws IOException {
return readRecord == null;
}
#Override
public GenericRecord nextRecord(GenericRecord genericRecord) throws IOException {
GenericRecord r = readRecord;
readRecord = parquetReader.read();
return r;
}
}
Then you can use the ExampleFileInputFormat like below
FileSink<GenericRecord> sink = FileSink.forBulkFormat(
new Path(path),
AvroParquetWriters.forGenericRecord(schema))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.enableCompactionOnCheckpoint(10)
.build(),
new RecordWiseFileCompactor<>(
new InputFormatBasedReader.Factory<>(new SerializableSupplierWithException<FileInputFormat<GenericRecord>, IOException>() {
#Override
public FileInputFormat<GenericRecord> get() throws IOException {
FileInputFormat<GenericRecord> format = new ExampleFileInputFormat();
return format;
}
})
))
.build();
I have successfully deployed this to a flink on k8s and compacted files on gcs. There're some notes for deploying.
You need to download flink shaded hadoop jar from https://flink.apache.org/downloads.html (search Pre-bundled Hadoop in webpage) and the jar into $FLINK_HOME/lib/
If you are writing files to some object storage, for example gcs, you need to follow the plugin instruction. Remember to put the plugin jar into the plugin folder but not the lib foler.
If you are writing files to some object storage, you need to download the connector jar from cloud service supplier. For example, I'm using gcs and download gcs-connector jar following GCP instruction. Put the jar into some foler other than $FLINK_HOME/lib or $FLINK_HOME/plugins. I put the connector jar into a newly made folder $FLINK_HOME/hadoop-lib
Set environment HADOOP_CLASSPATH=$FLINK_HOME/lib/YOUR_SHADED_HADOOP_JAR:$FLINK_HOME/hadoop-lib/YOUR_CONNECTOR_JAR
After all these steps, you can start your job and good to go.

Related

Does Dart have any log manager which can put logs in files (email etc.)?

I mean is the logger similar to log4j?
I found these loggers:
https://pub.dev/packages/logger;
https://pub.dev/packages/simple_logger;
https://pub.dev/packages/quick_log;
but they can't write in a file.
https://pub.dev/packages/f_logs
can't send email
https://pub.dev/packages/log_4_dart_2
do not work for me
Logger Readme
If you had read documentation, they have already provided solution where you can extend default logger to do anything with output.
Simply extend output class and send it to email or write through file streams or even log it to database, choice is yours.
class ConsoleOutput extends LogOutput {
#override void output(OutputEvent event) {
for (var line in event.lines) {
print(line);
}
}
You can write file and then get its contents via email service to do your reporting task.

Can I register a file in the distributed cache during FlatMapFunction in Flink?

I have a FlatMapFunction that lists items in S3. I want to register each item in the distributed file cache.
Is that even possible?
ie, in my job:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
...
... = myDataSet.flatMap(new S3Lister(env));
and in the S3Lister file:
...
String id = os.getKey().substring(os.getKey().lastIndexOf('/') + 1);
env.registerCachedFile("s3://" + bucket + os.getKey(), id);
...
and then later access it from the distributed cache in another custom coGroup function.
Could this work? Are you even allowed to pass the ExecutionEnvironment around like that?
Update:
If not, what's the best way to get an entire S3 bucket into a distributed file cache for use in a flink job?
Essentially, registerCachedFiles method helps to upload the files when submitting the job. So it's not possible to call it in a deployed program.
But from your description, why not read the S3 files directly?
You can use Reach functions and instead of normal ones, and then load your distributed cache in it.
First you load your file:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// register a file from HDFS
env.registerCachedFile("hdfs:///path/to/your/file", "hdfsFile")
// register a local executable file (script, executable, ...)
env.registerCachedFile("file:///path/to/exec/file", "localExecFile", true)
// define your program and execute
...
DataSet<String> input = ...
DataSet<Integer> result = input.map(new MyMapper());
...
env.execute();
and then use it in you ReachFunction class:
// extend a RichFunction to have access to the RuntimeContext
public final class MyMapper extends RichMapFunction<String, Integer> {
#Override
public void open(Configuration config) {
// access cached file via RuntimeContext and DistributedCache
File myFile = getRuntimeContext().getDistributedCache().getFile("hdfsFile");
// read the file (or navigate the directory)
...
}
#Override
public Integer map(String value) throws Exception {
// use content of cached file
...
}
}
You can see these in this Flink Documentation.

Hadoop Map Whole File in Java

I am trying to use Hadoop in java with multiple input files. At the moment I have two files, a big one to process and a smaller one that serves as a sort of index.
My problem is that I need to maintain the whole index file unsplitted while the big file is distributed to each mapper. Is there any way provided by the Hadoop API to make such thing?
In case if have not expressed myself correctly, here is a link to a picture that represents what I am trying to achieve: picture
Update:
Following the instructions provided by Santiago, I am now able to insert a file (or the URI, at least) from Amazon's S3 into the distributed cache like this:
job.addCacheFile(new Path("s3://myBucket/input/index.txt").toUri());
However, when the mapper tries to read it a 'file not found' exception occurs, which seems odd to me. I have checked the S3 location and everything seems to be fine. I have used other S3 locations to introduce the input and output file.
Error (note the single slash after the s3:)
FileNotFoundException: s3:/myBucket/input/index.txt (No such file or directory)
The following is the code I use to read the file from the distributed cache:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(cacheFile[0].toString()));
while ((line = br.readLine()) != null) {
//Do stuff
}
I am using Amazon's EMR, S3 and the version 2.4.0 of Hadoop.
As mentioned above, add your index file to the Distributed Cache and then access the same in your mapper. Behind the scenes. Hadoop framework will ensure that the index file will be sent to all the task trackers before any task is executed and will be available for your processing. In this case, data is transferred only once and will be available for all the tasks related your job.
However, instead of add the index file to the Distributed Cache in your mapper code, make your driver code to implement ToolRunner interface and override the run method. This provides the flexibility of passing the index file to Distributed Cache through the command prompt while submitting the job
If you are using ToolRunner, you can add files to the Distributed Cache directly from the command line when you run the job. No need to copy the file to HDFS first. Use the -files option to add files
hadoop jar yourjarname.jar YourDriverClassName -files cachefile1, cachefile2, cachefile3, ...
You can access the files in your Mapper or Reducer code as below:
File f1 = new File("cachefile1");
File f2 = new File("cachefile2");
File f3 = new File("cachefile3");
You could push the index file to the distributed cache, and it will be copied to the nodes before the mapper is executed.
See this SO thread.
Here's what helped me to solve the problem.
Since I am using Amazon's EMR with S3, I have needed to change the syntax a bit, as stated on the following site.
It was necessary to add the name the system was going to use to read the file from the cache, as follows:
job.addCacheFile(new URI("s3://myBucket/input/index.txt" + "#index.txt"));
This way, the program understands that the file introduced into the cache is named just index.txt. I also have needed to change the syntax to read the file from the cache. Instead of reading the entire path stored on the distributed cache, only the filename has to be used, as follows:
URI[] cacheFile = output.getCacheFiles();
BufferedReader br = new BufferedReader(new FileReader(#the filename#));
while ((line = br.readLine()) != null) {
//Do stuff
}

Wildcards in GCS bucket Java client api

Using Wildcards in file name i am trying to read files from GCS bucket.
in gsutil command line wildcards is working in specifying file names.
but in java client api
GcsFilename filename = new GcsFilename(BUCKETNAME, "big*");
it is searching for file named "big*" instead of file starting with big .
please help me how i can use Wildcards in GCSFilename.
Thanks in advance.
Wildcard characters are a feature of gsutil, but they're not an inherent part of the Google Cloud Storage API. You can, however, handle this the same way that gsutil does.
If you want to find the name of every object that begins with a certain prefix, Google Cloud Storage's APIs provide a list method with a "prefix" argument. Only objects matching the prefix will be returned. This doesn't work for arbitrary regular expressions, but it will work for your example.
The documentation for the list method goes into more detail.
As Brandon Yarbrough mentioned, GcsFilename represent a name of a single GCS Object, which could include any valid UTF-8 character [excluding a few such as \r \n but including '*' though
not recommended). see https://developers.google.com/storage/docs/bucketnaming#objectnames for more info.
GAE GCS client does not support listing yet (though that is planned to be added), so for now you can use the GCS XML or JSON API directly (using urlfetch) or use the Java GCS api client, https://developers.google.com/api-client-library/java/apis/storage/v1
See example for the latter option:
public class ListServlet extends HttpServlet {
public static final List<String> OAUTH_SCOPES =
ImmutableList.of("https://www.googleapis.com/auth/devstorage.read_write");
#Override
protected void doPost(HttpServletRequest req, HttpServletResponse resp)
throws ServletException, IOException {
try {
String bucket = req.getParameter("bucket");
AppIdentityCredential cred = new AppIdentityCredential(OAUTH_SCOPES);
Storage storage = new Storage.Builder(new UrlFetchTransport(), new JacksonFactory(), cred)
.setApplicationName(SystemProperty.applicationId.get()).build();
Objects.List list = storage.objects().list(bucket);
for (StorageObject o : list.execute().getItems()) {
resp.getWriter().println(o.getName() + " -> " + o);
}
} catch (Exception ex) {
throw new ServletException(ex);
}
}
}

Using apache camel csv processor with pollEnrich pattern?

Apache Camel 2.12.1
Is it possible to use the Camel CSV component with a pollEnrich? Every example I see is like:
from("file:somefile.csv").marshal...
Whereas I'm using the pollEnrich, like:
pollEnrich("file:somefile.csv", new CSVAggregator())
So within CSVAggregator I have no csv...I just have a file, which I have to do csv processing myself. So is there a way of hooking up the marshalling to the enrich bit somehow...?
EDIT
To make this more general... eg:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy) <--how can I call marshal() on this?
...
public class CSVAggregator implements AggregationStrategy {
#Override
public Exchange aggregate(Exchange oldExchange, Exchange newExchange) {
/* Here I have:
oldExchange = results of http blah endpoint
newExchange = the someFile.csv GenericFile object */
}
Is there any way I can avoid this and use marshal().csv sort of call on the route itself?
Thanks,
Mr Tea
You can use any endpoint in enrich. That includes direct endpoints pointing to other routes. Your example...
Replace this:
from("direct:start")
.to("http:www.blah")
.enrich("file:someFile.csv", new CSVAggregationStrategy)
With this:
from("direct:start")
.to("http:www.blah")
.enrich("direct:readSomeFile", new CSVAggregationStrategy);
from("direct:readSomeFile")
.to("file:someFile.csv")
.unmarshal(myDataFormat);
I ran into the same issue and managed to solve it with the following code (note, I'm using the scala dsl). My use case was slightly different, I wanted to load a CSV file and enrich it with data from an additional static CSV file.
from("direct:start") pollEnrich("file:c:/data/inbox?fileName=vipleaderboard.inclusions.csv&noop=true") unmarshal(csv)
from("file:c:/data/inbox?fileName=vipleaderboard.${date:now:yyyyMMdd}.csv") unmarshal(csv) enrich("direct:start", (current:Exchange, myStatic:Exchange) => {
// both exchange in bodies will contain lists instead of the file handles
})
Here the second route is the one which looks for a file in a specific directory. It unmarshals the CSV data from any matching file it finds and enriches it with the direct route defined in the preceding line. That route is pollEnriching with my static file and as I don't define an aggregation strategy it just replaces the contents of the body with the static file data. I can then unmarshal that from CSV and return the data.
The aggregation function in the second route then has access to both files' CSV data as List<List<String>> instead of just a file.

Resources