I have a generated Avro class (i.e. extends SpecificRecordBase) which happens to be named PeopleTransformation. I'm not using a schema registry and am hoping not to have to use one to solve this problem.
I am explicitly serializing constructed instances of PeopleTransformation in my flink application using:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.7</version>
</dependency>
The code is as follows (I recognise there there should be a try catch finally but that's a separate issue)
public static byte[] serialize(PeopleTransformation record) throws IOException {
DatumWriter<PeopleTransformation> writer = new SpecificDatumWriter<PeopleTransformation>(record.getSchema());
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(record, encoder);
encoder.flush();
IOUtils.closeQuietly(out);
return out.toByteArray();
}
I send the resulting byte array out through my FlinkKafkaProducer, via Kafka, to a Kafka consumer (separate JVM, non-Flink simple Java application) where I attempt to deserialize it using the same avro library.
public static <T extends SpecificRecordBase> T deserializeRecord(byte[] bytes, Class<T> avroClass)
throws IOException {
SpecificDatumReader<T> reader = new SpecificDatumReader<>(avroClass);
return reader.read(null, DecoderFactory.get().binaryDecoder(bytes, null));
}
This results in the following exception:
java.lang.ArrayIndexOutOfBoundsException: 40
at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:402)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
......
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
After investigation I've noted that my flink application contains the following dependency (which seems to be necessary even though my application makes no explicit use of its classes):
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-avro</artifactId>
<version>1.14.2</version>
</dependency>
and that that package introduces an org.apache.flink.formats.avro.typeutils.AvroInfoType class (https://github.com/apache/flink/blob/release-1.14.2/flink-formats/flink-avro/src/main/java/org/apache/flink/formats/avro/typeutils/AvroTypeInfo.java#L82-L85)
which, if I'm understanding correctly, seems to be used to automatically convert each CharSequence in a schema to a Utf8. The javadoc for this class (https://nightlies.apache.org/flink/flink-docs-master/api/java/org/apache/flink/formats/avro/typeutils/AvroTypeInfo.html)
says:
"Special type information to generate a special AvroTypeInfo for Avro POJOs
(implementing SpecificRecordBase, the typed Avro POJOs). Proceeding: It
uses a regular pojo type analysis and replaces all
GenericType with a GenericType<avro.Utf8>. All other types
used by Avro are standard Java types. Only strings are represented as
CharSequence fields and represented as Utf8 classes at runtime.
CharSequence is not comparable. To make them nicely usable with field
expressions, we replace them here by generic type infos containing Utf8
classes (which are comparable)".
Could the above substitution explain the incompatibility I'm seeing? If not is there any other explanation? (I've tried removing the flink-avro dependency from my flink application but I then
get a missing class exception with respect to AvroTypeInfo)?
Any explanation or insight would be very welcome. Note that in production use the messages will be consumed by third-party applications into which it would be difficult to introduce any Flink specifics.
Related
I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.
"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.
Since version 1.15 of Apache Flink you can use the compaction feature to merge several files into one.
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/datastream/filesystem/#compaction
How can we use compaction with bulk Parquet format?
The existing implementations for the RecordWiseFileCompactor.Reader (DecoderBasedReader and ImputFormatBasedReader) do not seem suitable for Parquet.
Furthermore we can not find any example for compacting Parquet or other bulk formats.
There are two types of file compactor mentioned in flink's document.
OutputStreamBasedFileCompactor : The users can write the compacted results into an output stream. This is useful when the users don’t want to or can’t read records from the input files.
RecordWiseFileCompactor : The compactor can read records one-by-one from the input files and write into the result file similar to the FileWriter.
If I remember correctly, Parquet saves meta information at end of files. So obviously we need to use RecordWiseFileCompactor. Because we need to read the whole Parquet file so we can get the meta information at the end of the file. Then we can use the meta information (number of row groups, schema) to parse the file.
From the java api, to construct a RecordWiseFileCompactor, we need a instance of RecordWiseFileCompactor.Reader.Factory.
There are two implementations of interface RecordWiseFileCompactor.Reader.Factory, DecoderBasedReader.Factory and InputFormatBasedReader.Factory respectively.
DecoderBasedReader.Factory creates a DecoderBasedReader instance, which reads whole file content from InputStream. We can load the bytes into a buffer and parse the file from the byte buffer, which is obviously painful. So we don't use this implementation.
InputFormatBasedReader.Factory creates a InputFormatBasedReader, which reads whole file content using the FileInputFormat supplier we passed to InputFormatBasedReader.Factory constructor.
The InputFormatBasedReader instance uses the FileInputFormat to read record by record, and pass records to the writer which we passed to forBulkFormat call, till the end of the file.
The writer receives all the records and compact the records into one file.
So the question becomes what is FileInputFormat and how to implement it.
Though there are many methods and fields of class FileInputFormat, we know only four methods are called from InputFormatBasedReader from InputFormatBasedReader source code mentioned above.
open(FileInputSplit fileSplit), which opens the file
reachedEnd(), which checks if we hit end of file
nextRecord(), which reads next record from the opened file
close(), which cleans up the site
Luckily, there's a AvroParquetReader from package org.apache.parquet.avro we can utilize. It has already implemented open/read/close. So we can wrap the reader inside a FileInputFormat and use the AvroParquetReader to do all the dirty works.
Here's a example code snippet
import org.apache.avro.generic.GenericRecord;
import org.apache.flink.api.common.io.FileInputFormat;
import org.apache.flink.core.fs.FileInputSplit;
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.avro.AvroParquetReader;
import org.apache.parquet.hadoop.ParquetReader;
import org.apache.parquet.hadoop.util.HadoopInputFile;
import org.apache.parquet.io.InputFile;
import java.io.IOException;
public class ExampleFileInputFormat extends FileInputFormat<GenericRecord> {
private ParquetReader<GenericRecord> parquetReader;
private GenericRecord readRecord;
#Override
public void open(FileInputSplit split) throws IOException {
Configuration config = new Configuration();
// set hadoop config here
// for example, if you are using gcs, set fs.gs.impl here
// i haven't tried to use core-site.xml but i believe this is feasible
InputFile inputFile = HadoopInputFile.fromPath(new org.apache.hadoop.fs.Path(split.getPath().toUri()), config);
parquetReader = AvroParquetReader.<GenericRecord>builder(inputFile).build();
readRecord = parquetReader.read();
}
#Override
public void close() throws IOException {
parquetReader.close();
}
#Override
public boolean reachedEnd() throws IOException {
return readRecord == null;
}
#Override
public GenericRecord nextRecord(GenericRecord genericRecord) throws IOException {
GenericRecord r = readRecord;
readRecord = parquetReader.read();
return r;
}
}
Then you can use the ExampleFileInputFormat like below
FileSink<GenericRecord> sink = FileSink.forBulkFormat(
new Path(path),
AvroParquetWriters.forGenericRecord(schema))
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.enableCompact(
FileCompactStrategy.Builder.newBuilder()
.enableCompactionOnCheckpoint(10)
.build(),
new RecordWiseFileCompactor<>(
new InputFormatBasedReader.Factory<>(new SerializableSupplierWithException<FileInputFormat<GenericRecord>, IOException>() {
#Override
public FileInputFormat<GenericRecord> get() throws IOException {
FileInputFormat<GenericRecord> format = new ExampleFileInputFormat();
return format;
}
})
))
.build();
I have successfully deployed this to a flink on k8s and compacted files on gcs. There're some notes for deploying.
You need to download flink shaded hadoop jar from https://flink.apache.org/downloads.html (search Pre-bundled Hadoop in webpage) and the jar into $FLINK_HOME/lib/
If you are writing files to some object storage, for example gcs, you need to follow the plugin instruction. Remember to put the plugin jar into the plugin folder but not the lib foler.
If you are writing files to some object storage, you need to download the connector jar from cloud service supplier. For example, I'm using gcs and download gcs-connector jar following GCP instruction. Put the jar into some foler other than $FLINK_HOME/lib or $FLINK_HOME/plugins. I put the connector jar into a newly made folder $FLINK_HOME/hadoop-lib
Set environment HADOOP_CLASSPATH=$FLINK_HOME/lib/YOUR_SHADED_HADOOP_JAR:$FLINK_HOME/hadoop-lib/YOUR_CONNECTOR_JAR
After all these steps, you can start your job and good to go.
The vulnerability scan system detects a CVE regarding RestEasy 3.7.0: CVE-2021-20289
https://nvd.nist.gov/vuln/detail/CVE-2021-20289, which states RESTEasy should upgrade to above 4.6.0.Final. But, here comes the question: RESTEasy > 4 does not contains this submodule.
I noticed that in https://developer.jboss.org/en/resteasy/blog/2019/03/28/resteasy-4-is-coming-soon, it is stated that
the big resteasy-jaxrs and resteasy-client modules have been split into resteasy-core-spi, resteasy-client-api, resteasy-core and resteasy-client, with the first and second ones to be considered as public modules, for which we're expected to retain backward compatibility till next major release.
If I comment out the resteasy-jaxrs dependency from pom.xml, I will get error of cannot access class org/jboss/resteasy/microprofile/config/ResteasyConfigFactory. But I cannot find it in resteasy-core-spi or rest-client-api module. The nearest is resteasy-4.7.4.Final/resteasy-core-spi/src/main/java/org/jboss/resteasy/spi/config/ConfigurationFactory.java. But if the class name changed, there would not be easy migration. Or am I missing something?
Actually according to https://issues.redhat.com/browse/RESTEASY-2878, this CVE is fixed in 3.15.2. So I am lost.
At last I
migrate from resteasy 3 to 4, abandon resteasy-jaxrs and introduce resteasy-client-api and resteasy-client
switch from org.jboss.resteasy.client.jaxrs.ResteasyClientBuilder to org.jboss.resteasy.client.jaxrs.internal.ResteasyClientBuilderImpl, even though it's under internal package, it's a public class and Javadoc does not suggest against using it directly. And this implementation is quite standard, and introduces the minimal fraction while migrating. I also compared the default values set in the class, such as connectionPoolSize and so on, they are the same as in resteasy-jaxrs 3.
The code change is minimal:
// before
private ResteasyClient client = new ResteasyClientBuilder()
.connectionPoolSize(CONNECTION_POOL_SIZE)
.build();
// after
private ResteasyClient client = new ResteasyClientBuilderImpl()
.connectionPoolSize(CONNECTION_POOL_SIZE)
.build();
And the provider:
I am receiving content type text/plain. In Resteasy-jaxrs 3, I used ResteasyJackson2Provider and it implements MessageBodyReader and MessageBodyWriter, and it worked. Now, in Restyeasy 4, the content type check seems to be stricter and isReadable() of this same named class only accepts Content-Type of null or contains json. As I receive text/plain, it no longer works.
For reading plain text, I suggest using StringTextStar. A new class in Resteasy 4.7.5, and it seems to work. Reading inputstream and write as string, just what I need. Check its impl.
ResteasyClient client1 = new ResteasyClient()
.register(new ResteasyJackson2Provider()) // for JSON
.build();
ResteasyClient client2 = new ResteasyClient()
.register(new StringTextStar()) // for text/plain
.build();
And the auto-closeable client:
Now you need to use try-finally or try-with-resources to close it. It will be closed automatically if you don't, but you receive a warning: Closing an instance of ApacheHttpClient43Engine for you and so.
I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.
I'm trying to deserialize an object, which:
was created and serialized in another standard JVM (server)
implements traditional Java Externalizable interface
was passed over a network
public static void getData() {
ConnectionRequest req = new ConnectionRequest() {
#Override
protected void readResponse(InputStream is) throws IOException {
DataInputStream dis = new DataInputStream(is);
Employee recovered = new Employee();
recovered.internalize(1, dis);
}
};
req.setUrl(BASEURL);
req.setPost(false);
NetworkManager.getInstance().addToQueueAndWait(req);
}
From the remote jvm I'm passing object in ByteArray or ByteArrayInputStream and in CN1 I get EOFException.
Is it possible to transfer objects such way? Or should i use JSON.
I thought I don't need JSON, if I have Java on both sides..
Codename One's externalization interface isn't compatible with Java SE. Serialization and externalization relies on reflection and dynamic invocation which aren't practical on all of Codename One's targets (even Android where the binary is usually obfuscated).
You can pass an object however you will need to use the Codename One API to do so. You can effectively take the JavaSE.jar file from the Codename one project and use the API there to write/read the object.
Other than that your code to read the object is incorrect. You should use Util.readObject/writeObject. I suggest reading the great tutorial Steve Hannah wrote on the subject.