Does kafka connector only support append only stream - apache-flink

I have following simple code snippet that want to write the streamin group by result into a kafka topic.
The Kafka sink table definition:
CREATE TABLE sinkTable (
id STRING,
total_price DOUBLE
) WITH (
'connector' = 'kafka',
'topic' = 'test6',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'json',
'key.json.ignore-parse-errors' = 'true',
'key.fields' = 'id',
'value.format' = 'json',
'value.json.fail-on-missing-field' = 'false',
'value.fields-include' = 'ALL'
)
When I run the following query
insert into sinkTable
select id, sum(price)
from sourceTable
group by id
it throws exception, the exception is:
Table sink 'default_catalog.default_database.sinkTable' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[id], select=[id, SUM(price) AS EXPR$1])
I don't know where the problem is. I looks to me that connector:kafka doesn't support group by query?

The problem is exactly as You've described it, the default kafka connector only supports append only stream. And as You may imagine the query You are trying to run, will produce an update for every new element, since it will change the sum for elements with this id.
One of the simplest things to do is to use upsert-kafka connector, which will automatically handle updates and write them to kafka, but this one is only available since Flink 1.12, so You may want to update Yourself to this version if You haven't yet.

Related

Flink sql api - how to read kafka event which in turn has a location to s3

I am trying to using flink sql to read data from kafka topic. We have a pattern where, if payload size if greater than 1MB, we upload the payload to s3 and in the kafka event send a location to s3.
I have a flink table like this
CREATE TABLE table_name
(
header VARCHAR,
contentJson varchar,
`timestamp` timestamp_ltz
) WITH (
'connector' = 'kafka',
'topic' = 'topic-name',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'json.timestamp-format.standard' = 'ISO-8601'
);
Here the contentJson field can be actual json like
{
"stack": "stuff"
}
or it could be a string like /some-bucket/some-folder/actual-file.json
How do I use
insert into final_table
select
JSON_VALUE(header, '$.some-path-json') as value-1,
JSON_VALUE(contentJson, '$.some-path-json') as value-2 -- this works if the `contentJson` is actual json and not a point to s3 bucket.
from table_name
Question is can do all this with flink sql or should I convert the table to stream and process the message where i can call aws s3 api to get the data is contentJson is a s3 location
The right solution to these kinds of problems would be to use lookup join, unfortunately that's not supported for the filesystem connector.
I think the best solution right now would be to do it with the DataStream API.
I have written this example to show a possible solution, but I would advise against using it for the following reasons:
both table contents will be kept in Flinks state forever by default - more info
it requires the file to be already read by Flink to its state (you can set the interval to read files using source.monitor-interval)
the file you want to read must be in the path you specify for the filesystem table
CREATE TABLE kafka_source (
json_or_file STRING
) WITH (
'connector' = 'kafka',
-- ...
)
CREATE TABLE file_source (
content STRING,
`file.path` STRING NOT NULL METADATA
) WITH (
'connector' = 'filesystem',
'path' = 'file:///tmp/test',
'format' = 'raw'
)
SELECT
header,
CASE WHEN file_source.content IS NULL THEN
JSON_VALUE(kafka_source.json_or_file, '$.key')
ELSE
JSON_VALUE(file_source.content, '$.key')
END
FROM
kafka_source FULL
OUTER JOIN file_source ON file_source.`file.path` = kafka_source.json_or_file
WHERE
JSON_VALUE(
kafka_source.json_or_file, '$.key'
) IS NOT NULL
OR file_source.content IS NOT NULL

Flink SQL Tumble Aggregation result not written out to filesystem locally

Context
I have a Flink job coded by python SQL api. it is consuming source data from Kinesis and producing results to Kinesis. I want to make a local test to ensure the Flink application code is correct. So I mocked out both the source Kinesis and sink Kinesis with filesystem connector. And then run the test pipeline locally. Although the local flink job always run successfully. But when I look into the sink file. The sink file is alway empty. This has also been the case when I run the code in 'Flink SQL Client'.
Here is my code:
CREATE TABLE incoming_data (
requestId VARCHAR(4),
groupId VARCHAR(32),
userId VARCHAR(32),
requestStartTime VARCHAR(32),
processTime AS PROCTIME(),
requestTime AS TO_TIMESTAMP(SUBSTR(REPLACE(requestStartTime, 'T', ' '), 0, 23), 'yyyy-MM-dd HH:mm:ss.SSS'),
WATERMARK FOR requestTime AS requestTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'filesystem',
'path' = '/path/to/test/json/file.json',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
)
CREATE TABLE user_latest_request (
groupId VARCHAR(32),
userId VARCHAR(32),
latestRequestTime TIMESTAMP
) WITH (
'connector' = 'filesystem',
'path' = '/path/to/sink',
'format' = 'csv'
)
INSERT INTO user_latest_request
SELECT groupId,
userId,
MAX(requestTime) as latestRequestTime
FROM incoming_data
GROUP BY TUMBLE(processTime, INTERVAL '1' SECOND), groupId, userId;
Curious what I am doing wrong here.
Note:
I am using Flink 1.11.0
if I directly dump data from source to sink without windowing and grouping, it works fine. That means the source and sink table is set up correctly. So it seems the problem is around the Tumbling and grouping for local filesystem.
This code works fine with Kinesis source and sink.
Have you enabled checkpointing? This is required if you are in `STREAMING mode which appears to be the case. See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/file_sink/
The most likely cause is that there isn't enough data in the file being read to keep the job running long enough for the window to close. You have a processing-time-based window that is 1 second long, which means that the job will have to run for at least one second to guarantee that the first window will produce results.
Otherwise, once the source runs out of data the job will shut down, regardless of whether the window contains unreported results.
If you switch to event-time-based windowing, then when the file source runs out of data it will send one last watermark with the value MAX_WATERMARK, which will trigger the window.

Share dynamic tables between Flink programs

I have a Flink job that creates a Dynamic table from a database changelog stream. The table definition looks as follows:
tableEnv.sqlUpdate("""
CREATE TABLE some_table_name (
id INTEGER,
name STRING,
created_at BIGINT,
updated_at BIGINT
)
WITH (
'connector' = 'kafka',
'topic' = 'topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.zookeeper.connect' = 'localhost:2181',
'properties.group.id' = 'group_1',
'format' = 'debezium-json',
'debezium-json.schema-include' = 'true'
)
""")
When trying to reference that table in another running Flink application on the same cluster, my program returns an error: SqlValidatorException: Object 'some_table_name' not found. Is it possible to register that table somehow such that other programs can use it? For example in a statement like this:
tableEnv.sqlQuery("""
SELECT count(*) FROM some_table_name
""").execute().print()
Note that a table in Flink doesn't hold any data. Another Flink application can independently create another table backed by the same Kafka topic, for example . So not sharing tables between applications isn't as tragic as you might expect.
But you can share tables by storing them in an external catalog. E.g., you could use an Apache Hive catalog for this purpose. See the docs for more info.

How to register Flink table schema with nested fields?

I'm working on registering a DataStream as a Flink Table in StreamingTableEnvironment.
The syntax I'm using is like so:
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.createTemporaryView(
"foo",
dataStream,
$("f0").as("foo-id"),
$("f1").as("foo-value")
)
In this case "foo-id" and "foo-value" are primitive types. In the Datastream I also have Json objects that I would like to ideally register as a Row Type similar to the suggestion here for Flink CREATE TABLE command:
CREATE TABLE input(
id VARCHAR,
title VARCHAR,
properties ROW(`foo` VARCHAR)
) WITH (
'connector' = 'kafka-0.11',
'topic' = 'my-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'python-test',
'format' = 'json'
);
See: Get nested fields from Kafka message using Apache Flink SQL
Is there a similar way to define nested types using Expressions?
I'm not registering streams using the Create Table command with connector because the data format is custom, which is why I resorted to registering a stream.
I played around with return types and nesting some more and this solution seems to work:
DataStream<Row> testDataStream = env.fromCollection(Arrays.asList(
Row.of("sherin", 1L, Row.of(100L, 200L)),
Row.of("thomas", 1L, Row.of(100L, 200L))
)).returns(
Types.ROW(
Types.STRING,
Types.LONG,
Types.ROW_NAMED(
new String[]{"val1", "val2"},
Types.LONG,
Types.LONG)
));
tableEnv.createTemporaryView(
"foo",
testDataStream,
$("f0").as("name"),
$("f1").as("c"),
$("f2").as("age"));
Table testTable = tableEnv.sqlQuery(
"select name, c, age.val1, age.val2 from foo"
);
DataStream<Row> result = tableEnv.toAppendStream(
testTable,
TypeInformation.of(Row.class)
);
result.print().setParallelism(2);
I'm still open to ideas, if there are better ways of doing this.

How to show the table content for the TableEnvironment in batch mode

I am using Flink 1.12.0, and use the following code to work with batch data.
I would like to show the content of the table. When I call print, it complains that table can't be converted to Dataset,but I don't want to use BatchTableEnviroment, which is kind of old planner API.
test("batch test") {
val settings = EnvironmentSettings.newInstance().inBatchMode().useBlinkPlanner().build()
val tenv = TableEnvironment.create(settings)
val ddl =
"""
create table sourceTable(
key STRING,
`date` STRING,
price DOUBLE
) with (
'connector' = 'filesystem',
'path' = 'D:/projects/openprojects3/learn.flink.ioc/data/stock.csv',
'format' = 'csv'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
val table = tenv.sqlQuery(
"""
select * from sourceTable
""".stripMargin(' '))
table.print()
}
To my knowledge a processing time temporal table join requires that the the lookup table is backed by a LookupTableSource. So far, in the Flink code base itself only the Hive, HBase and JDBC connector implement this interface. If you want to quickly try this out, you can also use [1], which also implements said interface.
[1] https://github.com/knaufk/flink-faker

Resources