Using Flink with thrift - apache-flink

I'm seeing some logs within my flink app with respect to my thrift classes:
2020-06-01 14:31:28 INFO TypeExtractor:1885 - Class class com.test.TestStruct contains custom serialization methods we do not call, so it cannot be used as a POJO type and must be processed as GenericType. Please read the Flink documentation on "Data Types & Serialization" for details of the effect on performance.
So I followed the instructions here:
https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html#apache-thrift-via-kryo
And I did that for the thrift of TestStruct along with all the thrift structs within that. ( I've skipped over named types though ).
Also the thrift code that got generated is in Java whereas the flink app is written using scala.
How would I make that error disappear? Because I'm getting another bug where if I pass my dataStream to convert into that TestStruct, some fields are missing. I suspect this is due to serialization issues?

Actually, as of now, you can't get rid of this warning, but it is also not a problem for the following reason:
The warning basically just says that Flink's type system is not using any of its internal serializers but will instead treat the type as a "generic type" which means, it is serialized via Kryo. If you followed my blog post on this, this is exactly what you want: use Kryo to serialize via Thrift. You could use a debugger to set a breakpoint into TBaseSerializer to verify that Thrift is being used.
As for the missing fields, I would suspect that this happens during the conversion into your TestStruct in your (flat)map operator and maybe not in the serialization that is used to pass this struct to the next operator. You should verify where these fields get missing - if you have this reproducible, a breakpoint in the debugger of your favourite IDE should help you find the cause.

Related

Is it better to use Row or GenericRowData with DataStream API?

I Am working with flink 1.15.2, should i use Row or GenericRowData that inherit RowData for my own data type?, i mostly use streaming api.
Thanks.
Sig.
In general the DataStream API is very flexible when it comes to record types. POJO types might be the most convenient ones. Basically any Java class can be used but you need to check which TypeInformation is extracted via reflection. Sometimes it is necessary to manually overwrite it.
For Row you will always have to provide the types manually as reflection cannot do much based on class signatures.
GenericRowData should be avoided, it is rather an internal class with many caveats (strings must be StringData and array handling is not straightforward). Also GenericRowData becomes BinaryRowData after deserialization. TLDR This type is meant for the SQL engine.
The docs are actually helpful here, I was confused too.
The section at the top titled "All Known Implementing Classes" lists all the implementations. RowData and GenericRowData are described as internal data structures. If you can use a POJO, then great. But if you need something that implements RowData, take a look at BinaryRowData, BoxedWrapperRowData, ColumnarRowData, NestedRowData, or any of the implementations there that aren't listed as internal.
I'm personally using NestedRowData to map a DataStream[Row] into a DataStream[RowData] and I'm not at all sure that's a good idea :) Especially since I can't seem to add a string attribute

GAE Datastore "java.lang.IllegalArgumentException: Property `${property}' contains an invalid nested entity."

Started receiving an error for the past couple days for persisting nested map structure as an Embedded entity. It was working early without any problem.
java.lang.IllegalArgumentException: Property metrics contains an invalid nested entity.
at com.google.appengine.api.datastore.DatastoreApiHelper.translateError(DatastoreApiHelper.java:49)
at com.google.appengine.api.datastore.DatastoreApiHelper$1.convertException(DatastoreApiHelper.java:127)
at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:97)
at com.google.appengine.api.datastore.Batcher$ReorderingMultiFuture.get(Batcher.java:115)
at com.google.appengine.api.datastore.FutureHelper$TxnAwareFuture.get(FutureHelper.java:171)
at com.googlecode.objectify.cache.TriggerFuture.get(TriggerFuture.java:100)
at com.googlecode.objectify.impl.ResultAdapter.now(ResultAdapter.java:34)
Also already that property unindexed. Technically it should ignore the limit of 1500 bytes. I think they made some changes to restrict this.
This error is not documented anywhere.
Keys with dots cannot be used in embedded maps because they have a unique meaning in the new API. If you must have dots in your keys, you can escape them using an #Stringify Stringifier.
Consequently, if you insist on using embedded maps, sanitize to prevent such exceptions.
Found the reason here: https://github.com/objectify/objectify/wiki/UpgradeVersion5ToVersion6
One reason for this error could be that you are directly sending protobufs and serialized some bytes that are simply not a valid entity.

how to implement the class `MyTupleReducer`in flink official document

I'm learning flink document-dataset api
there's a class calledmytupleReducer
I'm trying to complete it:
https://paste.ubuntu.com/p/3CjphGQrXP/
but it' full of red line in Intellij.
could you give me a right style of above code?
Thanks for your help~!
PS:
I'm writing part of MyTupleReduce
https://pastebin.ubuntu.com/p/m4rjs6t8QP/
but the return part is Wrong.
I suspect that importing the Reduce from akka has thrown you off course. Flink's ReduceFunction<T> interface needs you to implement a reduce method with this signature:
T reduce(T value1, T value2) throws Exception;
This is a classic reduce that takes two objects of type T and produces a third object of the same type.

apache flink - the correct way of error handling

I wonder if there is an option of built in error handling in Flink.
there may be 2 cases:
the current message from Kafka (in my case) is invalid, continue to next one
uncaught exception - from what I saw it can stop the stream aggregation completely.
ho can I handle these 2 cases? (java code)
1) This is done idiomatically with a flatMap: if your message is valid, you go on with a list containing your valid element (maybe already processed in the same step). If it's not valid, you simply return an empty list so that no elements are produced by that step. I could provide Scala code but I'm not familiar with Java APIs so I don't want to put you off track. Just check the flatMap call.
2) This depends on the type of exception: if it's provoked by your own code, just catch it and handle it inside the operator, or simply log it and move on. Without any further information about a specific case, this is the best I know of, but again, coming from Scala I haven't experienced runtime exceptions.

Source code compatibility between java 7 & 8 for overloaded functions

I have created a single jar for Java 7 & 8 for a JDBC driver (using -source/-target compile options). However, I am having difficulty compiling applications that use the new/overloaded methods in the ResultSet interface:
//New in Java 8
updateObject(int columnIndex, Object x, SQLType targetSqlType)
// Available in Java 7
updateObject(int columnIndex, Object x, int targetSqlType)
Note that SQLType is a new interface introduced in Java 8.
I have compiled the driver using Java 8, which worked fine. However, when any application using the driver accesses the method updateObject(int, Object, int) from Java 7, it gets a compilation error saying “class file for java.sql.SQLType not found”, although the application is not using SQLType. I think this is because Java looks at all the overloaded methods to determine the most specific one, and when doing so it can not access the new updateObject method in Java 8 (as SQLType is not defined in Java 7). Any idea how I can resolve this issue?
Note that the updateObject method has a default implementation in the ResultSet interface in Java 8 --- so I can not even use a more generic type instead of SQLType in the new method. In that case any application that uses the new method gets a compilation error saying updateObject is ambiguous.
You can't use something compiled in Java 8 (for instance) in a lower version (say Java 7). You will get something like Unsupported major.minor version.... You need to use two JARs, one for version 1.7 and the other one for version 1.8. Eventually, the one for the 1.7 can't have that SQLType if it's not supported on that JDK; on the other hand, you are encouraged to maintain the overloaded version when you do the 1.8 version.
Notice that this doesn't have nothing to do with backwards compatibility.
In this case, I would call it the application’s fault. After all, your class is implementing the ResultSet interface and applications using JDBC should be compiled against that interface instead of your implementation class.
If a Java 7 application is compiled under Java 7 (where SQLType does not exist) against the Java 7 version of the ResultSet interface, there should be no problems as that interface doesn’t have that offending updateObject method and it doesn’t matter which additional methods an implementation class has. If done correctly, the compiler shouldn’t even know that the implementation type will be your specific class.
You may enforce correct usage by declaring the methods of your Statement implementation class to return ResultSet instead of a more specific type. The same applies to the Connection returned by the Driver and the Statement returned by the Connection. It’s tempting to use covariant return types to declare your specific implementation classes but whenever your methods declare the interface type instead, you are guiding the application programmers to an interface based usage avoiding the problems described in your question.
The application programmer still may use a type cast, or better unwrap, to access custom features of your implementation if required. But then it’s explicit and the programmer knows what potential problems may occur (and how to avoid them).

Resources