I'm trying to use Flink to consume the change event log produced by Debezium. The JSON was this:
{
"schema":{
},
"payload":{
"before":null,
"after":{
"team_config_id":3800,
"team_config_team_id":"team22bcb26e-499a-41e6-8746-b7d980e79e04",
"team_config_sfdc_account_id":null,
"team_config_sfdc_account_url":null,
"team_config_business_type":5,
"team_config_dpsa_status":0,
"team_config_desc":null,
"team_config_company_id":null,
"team_config_hm_count_stages":null,
"team_config_assign_credits_times":null,
"team_config_real_renew_date":null,
"team_config_action_date":null,
"team_config_last_action_date":null,
"team_config_business_tier_notification":"{}",
"team_config_create_date":1670724933000,
"team_config_update_date":1670724933000,
"team_config_rediscovery_tier":0,
"team_config_rediscovery_tier_notification":"{}",
"team_config_sfdc_industry":null,
"team_config_sfdc_market_segment":null,
"team_config_unterminated_note_id":0
},
"source":{
},
"op":"c",
"ts_ms":1670724933149,
"transaction":null
}
}
And I've tried two ways to declare the input schema.
The first way was directly parse the JSON data :
create table team_config_source (
`payload` ROW <
`after` ROW <
...
team_config_create_date timestamp(3),
team_config_update_date timestamp(3),
...
>
>
) WITH (
'connector' = 'kafka',
...
'format' = 'json'
)
But Flink would throw an error org.apache.flink.formats.json.JsonToRowDataConverters$JsonParseException: Fail to deserialize at field: team_config_create_date caused by java.time.format.DateTimeParseException: Text '1670724933000' could not be parsed at index 0. Doesn't Flink support timestamp in this format?
I've also tried another way, using the built-in debezium format:
create table team_config_source (
team_config_create_id int,
...
) WITH (
'connector' = 'kafka',
...
'format' = 'debezium-json'
)
But Flink come up with another error java.io.IOException: Corrupt Debezium JSON message caused by java.lang.NullPointerException. I found somebody said that update event shouldn't has null as before value, but this message was a create event.
Could anyone help to check my DDL?
I am an a Flink expert but TIMESTAMP in Flink is not Epoch time, it is in datetime format.
In this case you can define table like:
team_config_create_bigint BIGINT,
team_config_update_bigint BIGINT,
...
team_config_create_date as TO_TIMESTAMP(FROM_UNIXTIME(team_config_create_bigint)),
team_config_update_date as TO_TIMESTAMP(FROM_UNIXTIME(team_config_update_bigint))
Related
In the e2e Flink SQL tutorial the source table is defined as a Kafka-sourced table with timestamp column upon which watermarking is enabled
CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3),
proctime AS PROCTIME(), -- generates processing-time attribute using computed column
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND -- defines watermark on ts column, marks ts as event-time attribute
) WITH (
'connector' = 'kafka', -- using kafka connector
'topic' = 'user_behavior', -- kafka topic
'scan.startup.mode' = 'earliest-offset', -- reading from the beginning
'properties.bootstrap.servers' = 'kafka:9094', -- kafka broker address
'format' = 'json' -- the data format is json
);
As long as GROUP BY is made by a TUMBLE upon ts field, it seems natural (since Flink knows when to trigger / eject the windows) but in the middle of the tutorial we see the following expression
INSERT INTO cumulative_uv
SELECT date_str, MAX(time_str), COUNT(DISTINCT user_id) as uv
FROM (
SELECT
DATE_FORMAT(ts, 'yyyy-MM-dd') as date_str,
SUBSTR(DATE_FORMAT(ts, 'HH:mm'),1,4) || '0' as time_str,
user_id
FROM user_behavior)
GROUP BY date_str;
Here we see that GROUP BY is made on derivative date_str field, but how does watermarking works here? How does Flink decides when to "close" date_str bucket? Since date_str is some function over ts, it must somehow understand how the watermark update for ts would translate into waterlevel for date_str field which seems unfeasable to me. How does it work internally, does Flink stores all encountered records in it's state?
Perhaps you can refer to the link below to learn about the generation and delivery of Watermarks, especially "How Operators Process Watermarks"
In this example, the watermark is generated from the ts of the source operator, and the downstream operator will only process the watermark, which has nothing to do with the date_str field.
public class TimestampsAndWatermarksOperator<T> extends AbstractStreamOperator<T>
implements OneInputStreamOperator<T, T>, ProcessingTimeCallback {
......
#Override
public void open() throws Exception {
super.open();
timestampAssigner = watermarkStrategy.createTimestampAssigner(this::getMetricGroup);
watermarkGenerator =
emitProgressiveWatermarks
? watermarkStrategy.createWatermarkGenerator(this::getMetricGroup)
: new NoWatermarksGenerator<>();
wmOutput = new WatermarkEmitter(output);
watermarkInterval = getExecutionConfig().getAutoWatermarkInterval();
if (watermarkInterval > 0 && emitProgressiveWatermarks) {
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + watermarkInterval, this);
}
}
#Override
public void processElement(final StreamRecord<T> element) throws Exception {
final T event = element.getValue();
final long previousTimestamp =
element.hasTimestamp() ? element.getTimestamp() : Long.MIN_VALUE;
final long newTimestamp = timestampAssigner.extractTimestamp(event, previousTimestamp);
element.setTimestamp(newTimestamp);
output.collect(element);
watermarkGenerator.onEvent(event, newTimestamp, wmOutput);
}
......
#Override
public void processWatermark(org.apache.flink.streaming.api.watermark.Watermark mark)
throws Exception {
// if we receive a Long.MAX_VALUE watermark we forward it since it is used
// to signal the end of input and to not block watermark progress downstream
if (mark.getTimestamp() == Long.MAX_VALUE) {
wmOutput.emitWatermark(Watermark.MAX_WATERMARK);
}
}
......
}
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/
I see examples that convert a Flink Table object to a DataStream and run StreamExecutionEnvironment.execute.
how would I code + run a continuous query that writes to a Streaming Sink with the table API without converting to a DataStream.
It seems this must be possible, because otherwise what is the purpose of specifying streaming sink Table Connectors?
The Table API docs list continuous queries and dynamic tables, yet most of the actual Java APIs and code examples seem to only use the table API for batch.
EDIT: To show David Anderson what I'm trying, here are the three Flink SQL CREATE TABLE statements on top of analogous Derby SQL tables.
I see the JDBC table connector sink supports streaming, but am I not configuring this correctly? I don't see anything that I'm overlooking.
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/jdbc.html
FYI, when I get my toy example working, I am planning on using Kafka in production for input/output stream-like data and JDBC/SQL for the lookup table.
CREATE TABLE LookupTableFlink (
`lookup_key` STRING NOT NULL,
`lookup_value` STRING NOT NULL,
PRIMARY KEY (lookup_key) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'LookupTable'
),
CREATE TABLE IncomingEventsFlink (
`field_to_use_as_lookup_key` STRING NOT NULL,
`extra_field` INTEGER NOT NULL,
`proctime` AS PROCTIME()
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'IncomingEvents'
), jdbcUrl);
CREATE TABLE TransformedEventsFlink (
`field_to_use_as_lookup_key` STRING,
`extra_field` INTEGER,
`lookup_key` STRING,
`lookup_value` STRING
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'TransformedEvents'
), jdbcUrl);
String sqlQuery =
"SELECT\n" +
" IncomingEventsFlink.field_to_use_as_lookup_key, IncomingEventsFlink.extra_field,\n" +
" LookupTableFlink.lookup_key, LookupTableFlink.lookup_value\n" +
"FROM IncomingEventsFlink\n" +
"LEFT JOIN LookupTableFlink FOR SYSTEM_TIME AS OF IncomingEventsFlink.proctime\n" +
"ON (IncomingEventsFlink.field_to_use_as_lookup_key = LookupTableFlink.lookup_key)\n";
Table joinQuery = tableEnv.sqlQuery(sqlQuery);
// This seems to run, return, and complete and doesn't seem to run in continuous/streaming mode.
TableResult tableResult = joinQuery.executeInsert(TransformedEventsFlink);
You can write to a dynamic table by using executeInsert, as in
Table orders = tableEnv.from("Orders");
orders.executeInsert("OutOrders");
The documentation is here.
It's explained here.
code example can be found here:
// get StreamTableEnvironment.
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
// convert the Table into an append DataStream of Tuple2<String, Integer>
// via a TypeInformation
TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>(
Types.STRING(),
Types.INT());
DataStream<Tuple2<String, Integer>> dsTuple =
tableEnv.toAppendStream(table, tupleType);
// convert the Table into a retract DataStream of Row.
// A retract stream of type X is a DataStream<Tuple2<Boolean, X>>.
// The boolean field indicates the type of the change.
// True is INSERT, false is DELETE.
DataStream<Tuple2<Boolean, Row>> retractStream =
tableEnv.toRetractStream(table, Row.class);
I'm trying to extract a few nested fields in PyFlink from JSON data received from Kafka. The JSON record schema is as follows. Basically, each record has a Result object within which there's an array of objects called data. I'm trying to extract the value field from the first array element i.e. data[0].
{
'ID': 'some-id',
'Result': {
'data': [
{
'value': 65537,
...
...
}
]
}
}
I'm using the Table API to read from a Kafka topic and write the extracted fields to another topic.
The source DDL is as follows:
source_ddl = """
CREATE TABLE InTable (
`ID` STRING,
`Timestamp` TIMESTAMP(3),
`Result` ROW(
`data` ROW(`value` BIGINT) ARRAY),
WATERMARK FOR `Timestamp` AS `Timestamp`
) WITH (
'connector' = 'kafka',
'topic' = 'in-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'my-group-id',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
)
"""
The corresponding sink DDL is:
sink_ddl = """
CREATE TABLE OutTable (
`ID` STRING,
`value` BIGINT
) WITH (
'connector' = 'kafka',
'topic' = 'out-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'my-group-id',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
)
"""
Here's the code snippet for extracting the value field from first element of the array:
t_env.execute_sql(source_ddl)
t_env.execute_sql(sink_ddl)
table = t_env.from_path('InTable')
table \
.select(
table.ID,
table.Result.data.at(1).value) \
.execute_insert('OutTable') \
.wait()
I see the following error in the execute_insert step, when I execute this.
py4j.protocol.Py4JJavaError: An error occurred while calling o57.executeInsert.
: scala.MatchError: ITEM($9.data, 1) (of class org.apache.calcite.rex.RexCall)
However, if I don't extract the embedded value but rather the entire row of the array i.e. table.Result.data.at(1) and modify the sink_ddl appropriately, I'm able to get the entire row properly.
Any idea, what am I missing? Thanks for any pointers!
Edit: This is probably a bug in Flink, and it being tracked by https://issues.apache.org/jira/browse/FLINK-22082.
I'm using the Flink FileSystem SQL Connector to read events from Kafka and write to S3(Using MinIo). Here is my code,
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 10 s
exec_env.enable_checkpointing(10000)
exec_env.set_state_backend(FsStateBackend("s3://test-bucket/checkpoints/"))
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = "source"
INPUT_TOPIC = "Rides"
LOCAL_KAFKA = 'kafka:9092'
OUTPUT_TABLE = "sink"
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'kafka',
'topic' = '{INPUT_TOPIC}',
'properties.bootstrap.servers' = '{LOCAL_KAFKA}',
'format' = 'json'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'filesystem',
'path' = 's3://test-bucket/kafka_output',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
I'm using Flink 1.11.3 and flink-s3-fs-hadoop 1.11.3. I have copied the flink-s3-fs-hadoop-1.11.3.jar into the plugins folder.
cp /opt/flink/lib/flink-s3-fs-hadoop-1.11.3.jar /opt/flink/plugins/s3-fs-hadoop/;
Also I have added the following configs into the flink-conf.yaml.
state.backend: filesystem
state.checkpoints.dir: s3://test-bucket/checkpoints/
s3.endpoint: http://127.0.0.1:9000
s3.path.style.access: true
s3.access-key: minio
s3.secret-key: minio123
MinIo is running properly and I have created the 'test-bucket' in MinIo. When I run this job the job submission doesn't happen and Flink Dashboard goes to a some sort of waiting state. After 15-20 mins I get the following exception,
pyflink.util.exceptions.TableException: Failed to execute sql
What seems to be the problem here?
see example :
create table conv (
SM ROW(objectType STRING, verb string, actor ROW(orgId string), object ROW(contentCategory string, links ARRAY<ROW(ecmType STRING)>)),
timestamp string
)
WITH (
'connector' = 'kafka', -- using kafka connector
'topic' = 'sj1_spark_conv_hdfs',
)
how to use explode on SM.object.ecmType instead of using ecmType[1]