How to convert RowData into Row when using DynamicTableSink - apache-flink

I have a question regarding the new sourceSinks interface in Flink. I currently implement a new custom DynamicTableSinkFactory, DynamicTableSink, SinkFunction and OutputFormat. I use the JDBC Connector as an example and I use Scala.
All data that is fed into the sink has the type Row. So the OutputFormat serialisation is based on the Row Interface:
override def writeRecord(record: Row): Unit = {...}
As stated in the documentation:
records must be accepted as org.apache.flink.table.data.RowData. The
framework provides runtime converters such that a sink can still work
on common data structures and perform a conversion at the beginning.
The goal here is to keep the Row data structure and only convert Row into RowData when inserted into the SinkFunction. So in this way the rest of the code does not need to be changed.
class MySinkFunction(outputFormat: MyOutputFormat) extends RichSinkFunction[RowData] with CheckpointedFunction
So the resulting question is: How to convert RowData into Row when using a DynamicTableSink and OutputFormat? Where should the conversion happen?
links:
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sourceSinks.html
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-jdbc/src/test/java/org/apache/flink/connector/jdbc
Thanks.

You can obtain a converter instance in the Context provided in org.apache.flink.table.connector.sink.DynamicTableSink#getSinkRuntimeProvider.
// create type information for the DeserializationSchema
final TypeInformation<RowData> producedTypeInfo =
context.createTypeInformation(producedDataType);
// most of the code in DeserializationSchema will not work on internal data structures
// create a converter for conversion at the end
final DataStructureConverter converter =
context.createDataStructureConverter(producedDataType);
The instance is Java serializable and can be passed into the sink function. You should also call the converter.open() method in your sink function.
A more complex example can be found here (for sources but sinks work in a similar way). Have a look at SocketDynamicTableSource and ChangelogCsvFormat in the same package.

Related

Flink SQL - How to parse a TIMESTAMP with custom pattern?

From documentation it looks like Flink's SQL can only parse timestamps in a certain format, namely:
TIMESTAMP string: Parses a timestamp string in the form "yy-mm-dd hh:mm:ss.fff" to a SQL timestamp.
Is there any way to pass in a custom DateTimeFormatter to parse a different kind of timestamp format?
You can implement any parsing logic using a user-defined scalar function (UDF).
This would look in Scala as follows.
class TsParser extends ScalarFunction {
def eval(s: String): Timestamp = {
// your logic
}
}
Once defined the function has to be registered at the TableEnvironment:
tableEnv.registerFunction("tsParser", new TsParser())
Now you can use the function tsParser just like any built-in function.
See the documentation for details.

Registering an Aggregate UDF in Apache Flink

I am trying to follow the steps here to create a basic Flink Aggregate UDF. I've added the dependencies () and implemented
public class MyAggregate extends AggregateFunction<Long, TestAgg> {..}
I've implemented the mandatory methods as well as a few other: accumulate, merge, etc. All this builds without errors. Now according to the docs, I should be able to register this as
StreamExecutionEnvironment sEnv = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment sTableEnv = StreamTableEnvironment.getTableEnvironment(sEnv);
sTableEnv.registerFunction("MyMin", new MyAggregate());
But, the registerFucntion seems to want a ScalarFunction only as input. I am getting an incompatible type error: The method registerFunction(String, ScalarFunction) in the type TableEnvironment is not applicable for the arguments (String, MyAggregate)
Any help would be great.
You need to import the StreamTableEnvironment for your chosen language which is in your case org.apache.flink.table.api.java.StreamTableEnvironment.
org.apache.flink.table.api.StreamTableEnvironment is a common abstract class for the Java and Scala variants of StreamTableEnvironment. We've noticed that this part of the API is confusing for users and we will improve it in the future.

JPA map entity with array datatype

I have a table which contains a column of type: integer[]
I'm trying to map my entity to this table and I've tried the following suggestion of:
#ElementCollection
private ArrayList<Integer> col;
public MyEntity() {
col = new ArrayList<>();
}
However I get the following error: Illegal attempt to map a non collection as a #OneToMany, #ManyToMany or #CollectionOfElements
Not sure how to get around this. I'm open to changing the entity's datatype, but I would prefer not to move this property into its own table/entity. Is there another solution? Thanks.
The field must be of type List<Integer>, not ArrayList<Integer>.
The JPA engine must be able to use its own List implementation, used for lazy-loading, dirty checking, etc.
It's a good idea in general to program on interfaces rather than implementations, and it's a requirement to do it in JPA entities.

Version tolerant XML Serialization

Reader C# project need to persists ~POCO to file. But we are at our debut and changes occurs quite often. Our soft is already used (persisted) by few customers.
I prefer to use XML over anything for many reasons.
I checked many many xml serialization libs.
Many libs stores the specific type and version. I don’t need that.
Many libs do not give us the possibility to serialize by ourself: ie we need an interface to custom load/save data (I see many advantages **)
Some libs forces us to have empty constructor
Some libs only manage public properties
Some libs have many limitations on types (do not support Dictionary, …)
** (advantages of an interface to load/save data)
Easier to manage many versions
Enable to do hardcoded conversion if required (class x -> class y, … )
Easier to not retain old code
I strongly think that for my needs we would better served by using the old way: a bit like deserializing in C++. I think we would be better served by something that would enable us to just add fields and fields name manually instead of using Attributes.
Kind of:
void XmlDeserialize(XmlReader xmlReader)
{
xmlReader.Load((n)=>Version(n)); // or just: _version = xmlReader.LoadInt("Version");
xmlReader.Load((n)=>Name(n));
xmlReader.Load((n)=>EmployeeId(n));
if (Version ==2)
…
If (version == 3)
…
The closest I have found to fit my needs was: DataContractSerializer that supports IExtensibleDataObject, but it is a pain and ass to use.
I question myself if I’m not wrong everywhere? It’s impossible I’m the only one with that need (or this vision). Why is nobody writing any lib for that, and did I miss something somewhere ?
What I think wrongly ? What do you recommend ?
Do you have to use XML reader.load for this? It is WAY easier to create the business objects that represent your XML data, and then deserialize the object, like below (sorry I only found my vb.net version of this):
Public Shared Function ReadFromString(ByVal theString As String, ByVal encoding As System.Text.Encoding, ByVal prohibitDTD As Boolean) As T
Dim theReturn As T = Nothing
Dim s As System.Xml.Serialization.XmlSerializer
s = New System.Xml.Serialization.XmlSerializer(GetType(T))
Dim theBytes As Byte() = encoding.GetBytes(theString)
Using ms As New IO.MemoryStream(theBytes)
Using sTr As New StreamReader(ms, encoding)
Dim sttng As New XmlReaderSettings
'sttng.ProhibitDtd = prohibitDTD
If Not prohibitDTD Then
sttng.DtdProcessing = DtdProcessing.Ignore
sttng.XmlResolver = Nothing
Else
sttng.DtdProcessing = DtdProcessing.Prohibit
End If
Using r As XmlReader = XmlReader.Create(sTr, sttng)
theReturn = CType(s.Deserialize(r), T)
End Using
End Using
End Using
Return theReturn
End Function
You can event get rid of the xmlreadersettings and the encoding if you like. But this way you could keep different business objects for each version you have? Additionally, if you're only adding (and not changing/deleting) objects, you can still use the most recent business object for all versions, and just ignore the missing fields.
I finally decided to use XmlSerialization like this
usage but I hate to be forced to create default constructor and not being able to serialize members (private or public).
I also decided to use ProtoContract when very high speed is necessary.
But my preferred one is DataContractSerializer where it offer xml format (easier to debug), no default constructor needed and can serialize any members.

Designing a generic DB utility class

Time and again I find myself creating a database utility class which has multiple functions which all do almost the same thing but treat the result set slightly differently.
For example, consider a Java class which has many functions which all look like this:
public void doSomeDatabaseOperation() {
Connection con = DriverManager.getConnection("jdbc:mydriver", "user", "pass");
try {
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("SELECT whatever FROM table"); // query will be different each time
while (rs.next()) {
// handle result set - differently each time
}
} catch (Exception e) {
// handle
} finally {
con.close();
}
}
Now imagine a class with 20 of these functions.
As you can see, tons of boilerplate (opening a connection, try-finally block), and the only thing that changes would be the query and the way you handle the result set. This type of code occurs in many languages (considering you're not using an ORM).
How do you manage your DB utility classes so as to reduce code duplication? What does a typical DB utility class look like in your language/framework?
The way I have done in one of my project is that I followed what Spring does with JDBC template and came up with a Query framework. Basically create a common class which can take select statement or pl/sql calls and bind parameters. If the query returns resultset, also pass the Rowmapper. This rowmapper object will be called by the framework to convert each row into an object of any kind.
Example -
Query execute = new Query("{any select or pl/sql}",
// Inputs and Outputs are for bind variables.
new SQL.Inputs(Integer.class, ...),
// Outputs is only meaningful for PL/SQL since the
// ResultSetMetaData should be used to obtain queried columns.
new SQL.Outputs(String.class));
If you want the rowmapper -
Query execute = new Query("{any select or pl/sql}",
// Inputs and Outputs are for bind variables.
new SQL.Inputs(Integer.class, ...),
// Outputs is only meaningful for PL/SQL since the
// ResultSetMetaData should be used to obtain queried columns.
new SQL.Outputs(String.class), new RowMapper() {
public Object mapRow(ResultSet rs, int rowNum) throws SQLException {
Actor actor = new Actor();
actor.setFirstName(rs.getString("first_name"));
actor.setSurname(rs.getString("surname"));
return actor;
});
Finally a Row class is the output which will have list of objects if you have passed the RowMapper -
for (Row r : execute.query(conn, id)) {
// Handle the rows
}
You can go fancy and use Templates so that type safety is guaranteed.
Sounds like you could make use of a Template Method pattern here. That would allow you to define the common steps (and default implementations of them, where applicable) that all subclasses will take to perform the action. Then subclasses need only override the steps which differ: SQL query, DB-field-to-object-field mapping, etc.
When using .net, the Data Access Application Block is in fairly widespread use to provide support for the following:
The [data access] application block
was designed to achieve the following
goals:
Encapsulate the logic used to perform
the most common data access tasks.
Eliminate common coding errors, such
as failing to close connections.
Relieve developers of the need to
write duplicated code for common data
access tasks.
Reduce the need for
custom code.
Incorporate best
practices for data access, as
described in the .NET Data Access
Architecture Guide.
Ensure that, as
far as possible, the application block
functions work with different types of
databases.
Ensure that applications
written for one type of database are,
in terms of data access, the same as
applications written for another type
of database.
There are plenty of examples and tutorials of usage too: a google search will find msdn.microsoft, 4guysfromrolla.com, codersource.com and others.

Resources