Error converting Flink DataStream to Table after a ProcessFunction call - apache-flink

I'm implementing a data analysis pipeline in Flink and I have a problem converting a DataStream to a Table. I have this table defined from a join between two Kafka sources:
Table legalFileEventsTable = legalFilesTable.join(eventsTable)
.where($("id").isEqual($("id_fascicolo")))
.select(
$("id").as("id_fascicolo"),
$("id_evento"),
$("giudice"),
$("nrg"),
$("codice_oggetto"),
$("ufficio"),
$("sezione"),
$("data_evento"),
$("evento"),
$("data_registrazione_evento")
);
Then I convert the joined table to a DataStream to apply some computation on the data. Here's the code I'm using:
DataStream<Row> phasesDurationsDataStream = tEnv.toChangelogStream(legalFileEventsTable)
.keyBy(r -> r.<Long>getFieldAs("id_fascicolo"))
.process(new PhaseDurationCounterProcessFunction());
phasesDurationsDataStream.print();
The PhaseDurationCounterProcessFunction emits a Row like this:
Row outputRow = Row.withNames(RowKind.INSERT);
outputRow.setField("id_fascicolo", currentState.getId_fascicolo());
outputRow.setField("nrg", currentState.getNrg());
outputRow.setField("giudice", currentState.getGiudice());
outputRow.setField("codice_oggetto", currentState.getCodice_oggetto());
outputRow.setField("ufficio", currentState.getUfficio());
outputRow.setField("sezione", currentState.getSezione());
outputRow.setField("fase", currentState.getPhase());
outputRow.setField("fase_completata", false);
outputRow.setField("durata", currentState.getDurationCounter());
out.collect(outputRow);
After collecting the results from the process function I reconvert the DataStream to a Table and execute the pipeline:
Table phasesDurationsTable = tEnv.fromChangelogStream(
phasesDurationsDataStream,
Schema.newBuilder()
.column("id_fascicolo", DataTypes.BIGINT())
.column("nrg", DataTypes.STRING())
.column("giudice", DataTypes.STRING())
.column("codice_oggetto", DataTypes.STRING())
.column("ufficio", DataTypes.STRING())
.column("sezione", DataTypes.STRING())
.column("fase", DataTypes.STRING())
.column("fase_completata", DataTypes.BOOLEAN())
.column("durata", DataTypes.BIGINT())
.primaryKey("id_fascicolo", "fase")
.build(),
ChangelogMode.upsert()
);
env.execute();
But during the startup I receive this exception:
Unable to find a field named 'id_fascicolo' in the physical data type derived from the given type information for schema declaration. Make sure that the type information is not a generic raw type. Currently available fields are: [f0]
It seems that the row information (name and type) aren't available yet and so the exception is generated. I tried to invoke the env.execute() before the DataStream->Table conversion and in this case the job starts but I have no output if I print the phasesDurationsTable. Any suggestions on how to make this work?

You need to specify the correct type information for Flink because it cannot figure out the schema from the generic Row type.
Here is an example, given our data stream is producing records like this:
Row exampleRow = Row.ofKind(RowKind.INSERT, "sensor-1", 32);
We need to define the correct type information the following way:
TypeInformation<?>[] types = {
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.LONG_TYPE_INFO
};
Using this we can define the row type information:
RowTypeInfo rowTypeInfo = new RowTypeInfo(
types,
new String[]{"sensor_name", "temperature"}
);
The last step is to specify the return type of this DataStream:
DataStream<Row> stream = env.fromSource(...).returns(rowTypeInfo)
Note that when using fromChangelogStream you only need to provide a schema if you want to use a different one from the type info returned by the DataStream, so tEnv.fromChangelogStream(stream) works just fine and will use sensor_name and temperature as the schema by default.

Related

Finding missing records from 2 data sources with Flink

I have two data sources - an S3 bucket and a postgres database table. Both sources have records in the same format with a unique identifier of type uuid. Some of the records present in the S3 bucket are not part of the postgres table and the intent is to find those missing records. The data is bounded as it is partitioned by every day in the s3 bucket.
Reading the s3-source (I believe this operation is reading the data in batch mode since I am not providing the monitorContinuously() argument) -
final FileSource<GenericRecord> source = FileSource.forRecordStreamFormat(
AvroParquetReaders.forGenericRecord(schema), path).build();
final DataStream<GenericRecord> avroStream = env.fromSource(
source, WatermarkStrategy.noWatermarks(), "s3-source");
DataStream<Row> s3Stream = avroStream.map(x -> Row.of(x.get("uuid").toString()))
.returns(Types.ROW_NAMED(new String[] {"uuid"}, Types.STRING));
Table s3table = tableEnv.fromDataStream(s3Stream);
tableEnv.createTemporaryView("s3table", s3table);
For reading from Postgres, I created a postgres catalog -
PostgresCatalog postgresCatalog = (PostgresCatalog) JdbcCatalogUtils.createCatalog(
catalogName,
defaultDatabase,
username,
pwd,
baseUrl);
tableEnv.registerCatalog(postgresCatalog.getName(), postgresCatalog);
tableEnv.useCatalog(postgresCatalog.getName());
Table dbtable = tableEnv.sqlQuery("select cast(uuid as varchar) from `localschema.table`");
tableEnv.createTemporaryView("dbtable", dbtable);
My intention was to simply perform left join and find the missing records from the dbtable. Something like this -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.uuid = dbtable.uuid where dbtable.uuid is null");
DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
resultStream.print();
However, it seems that the UUID column type is not supported just yet because I get the following exception.
Caused by: java.lang.UnsupportedOperationException: Doesn't support Postgres type 'uuid' yet
at org.apache.flink.connector.jdbc.dialect.psql.PostgresTypeMapper.mapping(PostgresTypeMapper.java:171)
As an alternative, I tried to read the database table as follows -
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
BasicTypeInfo.of(String.class)
};
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
JdbcInputFormat jdbcInputFormat = JdbcInputFormat.buildJdbcInputFormat()
.setDrivername("org.postgresql.Driver")
.setDBUrl("jdbc:postgresql://127.0.0.1:5432/localdatabase")
.setQuery("select cast(uuid as varchar) from localschema.table")
.setUsername("postgres")
.setPassword("postgres")
.setRowTypeInfo(rowTypeInfo)
.finish();
DataStream<Row> dbStream = env.createInput(jdbcInputFormat);
Table dbtable = tableEnv.fromDataStream(dbStream).as("uuid");
tableEnv.createTemporaryView("dbtable", dbtable);
Only this time, I get the following exception on performing the left join (as above) -
Exception in thread "main" org.apache.flink.table.api.TableException: Table sink '*anonymous_datastream_sink$3*' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]
It works if I tweak the resultStream to publish the changeLogStream -
Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.sync_id = dbtable.sync_id where dbtable.sync_id is null");
DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
resultStream.print();
Sample O/P
+I[9cc38226-bcce-47ce-befc-3576195a0933, null]
+I[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
+I[da6f57c8-3ad1-4df5-9636-c6b36df2695f, null]
+I[2f3845c1-6444-44b6-b1e8-c694eee63403, null]
-D[9cc38226-bcce-47ce-befc-3576195a0933, null]
-D[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
However, I do not want the sink to have Inserts and Deletes as separate. I want just the final list of missing uuids. I guess it happens because my Postgres Source created with DataStream<Row> dbStream = env.createInput(jdbcInputFormat); is a streaming source. If I try to execute the whole application in BATCH mode, I get the following exception -
org.apache.flink.table.api.ValidationException: Querying an unbounded table '*anonymous_datastream_source$2*' in batch mode is not allowed. The table source is unbounded.
Is it possible to have a bounded JDBC source? If not, how can I achieve this using the streaming API. (using Flink version - 1.15.2)
I believe this kind of case would be a common usecase that can be implemented with Flink but clearly I'm missing something. Any leads would be appreciated.
For now common approach would be to sink the resultStream to a table. So you can schedule a job which truncates the table and then executes the Apache Flink job. And then read the results from this table.
I also noticed Apache Flink Table Store 0.3.0 is just released. And they have materialized views on the roadmap for 0.4.0. This might be a solution too. Very exciting imho.

Flink SQL : UDTF passes Row type parameters

CREATE TABLE user_log (
data ROW(id String,user_id String,class_id String)
) WITH (
'connector.type' = 'kafka',
...
);
INSERT INTO sink
SELECT * FROM user_log as tab,
LATERAL TABLE(splitUdtf(tab.data)) AS T(a,b,c);
UDTF Code:
public void eval(Row data) {...}
Can the eval method only pass Row type parameters? I want to get the key of Row in SQL,such as id,user_id,class_id,But the key of Row in java is index (such as 0,1,2).How do i do it? Thank you!
Is your sql able to directly convert kafka data to table Row? Maybe not .
Row is the type at the DataStream level, not the type in TableAPI&SQL.
If the data you received from kafka is in json format, you can use the DDL statement in fllink sql or use the Connector API to directly extract the fields in json, as long as your json is in key-value format.

How to convert DataStream[Row] into Table?

For this case, java environment support method returns() to specify type information.
But int scala environment ,org.apache.flink.streaming.api.scala.DataStream does't have returns().
How to convert a scala DataStream[Row] into a scala Table?
According to the table API documentation, you can use
// get a TableEnvironment
val tableEnv: StreamTableEnvironment = ... // see "Create a TableEnvironment" section
// DataStream of Row with two fields "name" and "age" specified in `RowTypeInfo`
val stream: DataStream[Row] = ...
// convert DataStream into Table with default field names "name", "age"
val table: Table = tableEnv.fromDataStream(stream)

postgres insert string to numeric column - auto-typecast does not happen

There is a table in postgres DB test1 having schema :
We are using spring frameworks jdbcTemplate to insert data as below:
Object[] params = {"978","tour"};
jdbcTemplate.update("insert into test1 values (?,?)", params);
But this gives the exception :
org.springframework.jdbc.BadSqlGrammarException: PreparedStatementCallback; bad SQL grammar [insert into test1 values (?,?)]; nested exception is org.postgresql.util.PSQLException: ERROR: column "id" is of type integer but expression is of type character varying
ERROR: column "id" is of type integer but expression is of type character varying
This works for Oracle database through implicit type conversion, but postgres does nOt seem to work that way.
Could this be an issue with postgres driver?
A workaround would be to cast explicitly:
insert into test1 values (?::numeric ,?)
But is there better way to do the conversion as this does not seem like a good solution since there are lot of queries to be modified and also there can be other such casting issues too.
Is there some parameter that can be set at DB level to perform an auto cast?
We found the answer here
Storing json, jsonb, hstore, xml, enum, ipaddr, etc fails with "column "x" is of type json but expression is of type character varying"
A new connection propertyshould be added :
String url = "jdbc:postgresql://localhost/test";
Properties props = new Properties();
props.setProperty("user","fred");
props.setProperty("password","secret");
props.setProperty("stringtype", "unspecified");
Connection conn = DriverManager.getConnection(url, props);
https://jdbc.postgresql.org/documentation/94/connect.html
"If stringtype is set to unspecified, parameters will be sent to the server as untyped values, and the server will attempt to infer an appropriate type. This is useful if you have an existing application that uses setString() to set parameters that are actually some other type, such as integers, and you are unable to change the application to use an appropriate method such as setInt()"
Yeah, drop the double quotes here:
Object[] params = {"978","tour"};
Becomes
Object[] params = {978,"tour"};
Alternatively do the casting as you mentioned.

JOOQ fails with PostgreSQL Custom Type as an Array: ERROR: malformed record literal

I have the following custom type on Postgres:
CREATE TYPE my_custom_type AS (
field_a VARCHAR,
field_b NUMERIC(10,3)
);
and the following table:
CREATE TABLE my_table
(
COL1 VARCHAR(120) NOT NULL,
CUSTOM_COLUMN my_custom_type,
CUSTOM_COLUMN_ARRAY my_custom_type[]
);
Everything works fine when I use my custom type with JOOQ:
#Test
public void testWithoutArray(){
MyTableRecord record = dsl.newRecord(MyTable.MY_TABLE);
record.setCol1("My Col1");
MyCustomType customType = new MyCustomType();
customType.setFieldA("Field A Val");
customType.setFieldB(BigDecimal.ONE);
record.setCustomColumn(customType);
record.store();
}
However, when I try to set some value in the field mapped to a custom type array, I have the following error:
#Test
public void testWithArray(){
MyTableRecord record = dsl.newRecord(MyTable.MY_TABLE);
record.setCol1("My Col1");
MyCustomTypeRecord customType = new MyCustomTypeRecord();
customType.setFieldA("Field A Val 1");
customType.setFieldB(BigDecimal.ONE);
MyCustomTypeRecord customType2 = new MyCustomTypeRecord();
customType2.setFieldA("Field A Val 2");
customType2.setFieldB(BigDecimal.TEN);
record.setCustomColumnArray(new MyCustomTypeRecord[]{customType, customType2});
record.store();
}
org.jooq.exception.DataAccessException: SQL [insert into "my_table" ("col1", "custom_column_array") values (?, ?::my_custom_type[]) returning "my_table"."col1"]; ERROR: malformed record literal: "my_custom_type"(Field A Val 1, 1)"
Detail: Missing left parenthesis.
at org.jooq.impl.Utils.translate(Utils.java:1553)
at org.jooq.impl.DefaultExecuteContext.sqlException(DefaultExecuteContext.java:571)
at org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:347)
at org.jooq.impl.TableRecordImpl.storeInsert0(TableRecordImpl.java:176)
at org.jooq.impl.TableRecordImpl$1.operate(TableRecordImpl.java:142)
at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123)
at org.jooq.impl.TableRecordImpl.storeInsert(TableRecordImpl.java:137)
at org.jooq.impl.UpdatableRecordImpl.store0(UpdatableRecordImpl.java:185)
at org.jooq.impl.UpdatableRecordImpl.access$000(UpdatableRecordImpl.java:85)
at org.jooq.impl.UpdatableRecordImpl$1.operate(UpdatableRecordImpl.java:135)
at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123)
at org.jooq.impl.UpdatableRecordImpl.store(UpdatableRecordImpl.java:130)
at org.jooq.impl.UpdatableRecordImpl.store(UpdatableRecordImpl.java:123)
The query generated by JOOQ debugg is the following:
DEBUG [main] org.jooq.tools.LoggerListener#debug:255 - Executing query : insert into "my_table" ("col1", "custom_column_array") values (?, ?::my_custom_type[]) returning "my_table"."col1"
DEBUG [main] org.jooq.tools.LoggerListener#debug:255 - -> with bind values : insert into "my_table" ("col1", "custom_column_array") values ('My Col1', array[[UDT], [UDT]]) returning "my_table"."col1"
Am I missing some configuration or is it a bug?
Cheers
As stated in the relevant issue (https://github.com/jOOQ/jOOQ/issues/4162), this is a missing piece of support for this kind of PostgreSQL functionality. The answer given in the issue so far is:
Unfortunately, this is an area where we have to work around a couple of limitations of the PostgreSQL JDBC driver, which doesn't implement SQLData and other API (see also pgjdbc/pgjdbc#63).
Currently, jOOQ binds arrays and UDTs as strings. It seems that this particular combination is not yet supported. You will probably be able to work around this limitation by implementing your own custom data type Binding:
http://www.jooq.org/doc/latest/manual/code-generation/custom-data-type-bindings/

Resources