I have a Flink job that creates a Dynamic table from a database changelog stream. The table definition looks as follows:
tableEnv.sqlUpdate("""
CREATE TABLE some_table_name (
id INTEGER,
name STRING,
created_at BIGINT,
updated_at BIGINT
)
WITH (
'connector' = 'kafka',
'topic' = 'topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.zookeeper.connect' = 'localhost:2181',
'properties.group.id' = 'group_1',
'format' = 'debezium-json',
'debezium-json.schema-include' = 'true'
)
""")
When trying to reference that table in another running Flink application on the same cluster, my program returns an error: SqlValidatorException: Object 'some_table_name' not found. Is it possible to register that table somehow such that other programs can use it? For example in a statement like this:
tableEnv.sqlQuery("""
SELECT count(*) FROM some_table_name
""").execute().print()
Note that a table in Flink doesn't hold any data. Another Flink application can independently create another table backed by the same Kafka topic, for example . So not sharing tables between applications isn't as tragic as you might expect.
But you can share tables by storing them in an external catalog. E.g., you could use an Apache Hive catalog for this purpose. See the docs for more info.
Related
I am trying to using flink sql to read data from kafka topic. We have a pattern where, if payload size if greater than 1MB, we upload the payload to s3 and in the kafka event send a location to s3.
I have a flink table like this
CREATE TABLE table_name
(
header VARCHAR,
contentJson varchar,
`timestamp` timestamp_ltz
) WITH (
'connector' = 'kafka',
'topic' = 'topic-name',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'json.timestamp-format.standard' = 'ISO-8601'
);
Here the contentJson field can be actual json like
{
"stack": "stuff"
}
or it could be a string like /some-bucket/some-folder/actual-file.json
How do I use
insert into final_table
select
JSON_VALUE(header, '$.some-path-json') as value-1,
JSON_VALUE(contentJson, '$.some-path-json') as value-2 -- this works if the `contentJson` is actual json and not a point to s3 bucket.
from table_name
Question is can do all this with flink sql or should I convert the table to stream and process the message where i can call aws s3 api to get the data is contentJson is a s3 location
The right solution to these kinds of problems would be to use lookup join, unfortunately that's not supported for the filesystem connector.
I think the best solution right now would be to do it with the DataStream API.
I have written this example to show a possible solution, but I would advise against using it for the following reasons:
both table contents will be kept in Flinks state forever by default - more info
it requires the file to be already read by Flink to its state (you can set the interval to read files using source.monitor-interval)
the file you want to read must be in the path you specify for the filesystem table
CREATE TABLE kafka_source (
json_or_file STRING
) WITH (
'connector' = 'kafka',
-- ...
)
CREATE TABLE file_source (
content STRING,
`file.path` STRING NOT NULL METADATA
) WITH (
'connector' = 'filesystem',
'path' = 'file:///tmp/test',
'format' = 'raw'
)
SELECT
header,
CASE WHEN file_source.content IS NULL THEN
JSON_VALUE(kafka_source.json_or_file, '$.key')
ELSE
JSON_VALUE(file_source.content, '$.key')
END
FROM
kafka_source FULL
OUTER JOIN file_source ON file_source.`file.path` = kafka_source.json_or_file
WHERE
JSON_VALUE(
kafka_source.json_or_file, '$.key'
) IS NOT NULL
OR file_source.content IS NOT NULL
In my pipeline I am using pyflink to load & transform data from an RDS and sink to a MYSQL. Using FLINK CDC I am able to get the data I want from the RDS and with JDBC library sink to MYSQL. My aim is to read 1 table and create 10 others using a sample of the code below, in 1 job (basically breaking a huge table in smaller tables). The problem I am facing is despite using RocksDB as state backend and options in flink cdc such as scan.incremental.snapshot.chunk.size and scan.snapshot.fetch.size and debezium.min.row. count.to.stream.result the usage memory keeps growing causing a Taskmanager with 2GB memory to fail. My intuition here is that a simple select- insert query loads all table in memory no matter what!If so, can I somehow avoid that? The table size is around 500k rows.
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
stmt_set = t_env.create_statement_set()
create_kafka_source= (
"""
CREATE TABLE somethin(
bla INT,
bla1 DOUBLE,
bla2 TIMESTAMP(3),
PRIMARY KEY(bla2) NOT ENFORCED
) WITH (
'connector'='mysql-cdc',
'server-id'='1000',
'debezium.snapshot.mode' = 'when_needed',
'debezium.poll.interval.ms'='5000',
'hostname'= 'som2',
'port' ='som2',
'database-name'='som3',
'username'='som4',
'password'='somepass',
'table-name' = 'atable'
)
"""
)
create_kafka_dest = (
"""CREATE TABLE IF NOT EXISTS atable(
time1 TIMESTAMP(3),
blah2 DOUBLE,
PRIMARY KEY(time_stamp) NOT ENFORCED
) WITH ( 'connector'= 'jdbc',
'url' = 'jdbc:mysql://name1:3306/name1',
'table-name' = 't1','username' = 'user123',
'password' = '123'
)"""
)
t_env.execute_sql(create_kafka_source)
t_env.execute_sql(create_kafka_dest)
stmt_set.add_insert_sql(
"INSERT INTO atable SELECT DISTINCT bla2,bla1,"
"FROM somethin"
)
Using DISTINCT in a streaming query is expensive, especially when there aren't any temporal constraints on the distinctiveness (e.g., counting unique visitors per day). I imagine that's why your query needs a lot of state.
However, you should be able to get this to work. RocksDB isn't always well-behaved; sometimes it will consume more memory than it has been allocated.
What version of Flink are you using? Improvements were made in Flink 1.11 (by switching to jemalloc) and further improvements came in Flink 1.14 (by upgrading to a newer version of RocksDB). So upgrading Flink might fix this. Otherwise you may need to basically lie and tell Flink it has somewhat less memory than it actually has, so that when RocksDB steps out of bounds it doesn't cause out-of-memory errors.
I have following simple code snippet that want to write the streamin group by result into a kafka topic.
The Kafka sink table definition:
CREATE TABLE sinkTable (
id STRING,
total_price DOUBLE
) WITH (
'connector' = 'kafka',
'topic' = 'test6',
'properties.bootstrap.servers' = 'localhost:9092',
'key.format' = 'json',
'key.json.ignore-parse-errors' = 'true',
'key.fields' = 'id',
'value.format' = 'json',
'value.json.fail-on-missing-field' = 'false',
'value.fields-include' = 'ALL'
)
When I run the following query
insert into sinkTable
select id, sum(price)
from sourceTable
group by id
it throws exception, the exception is:
Table sink 'default_catalog.default_database.sinkTable' doesn't support consuming update changes which is produced by node GroupAggregate(groupBy=[id], select=[id, SUM(price) AS EXPR$1])
I don't know where the problem is. I looks to me that connector:kafka doesn't support group by query?
The problem is exactly as You've described it, the default kafka connector only supports append only stream. And as You may imagine the query You are trying to run, will produce an update for every new element, since it will change the sum for elements with this id.
One of the simplest things to do is to use upsert-kafka connector, which will automatically handle updates and write them to kafka, but this one is only available since Flink 1.12, so You may want to update Yourself to this version if You haven't yet.
I'm working on registering a DataStream as a Flink Table in StreamingTableEnvironment.
The syntax I'm using is like so:
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.createTemporaryView(
"foo",
dataStream,
$("f0").as("foo-id"),
$("f1").as("foo-value")
)
In this case "foo-id" and "foo-value" are primitive types. In the Datastream I also have Json objects that I would like to ideally register as a Row Type similar to the suggestion here for Flink CREATE TABLE command:
CREATE TABLE input(
id VARCHAR,
title VARCHAR,
properties ROW(`foo` VARCHAR)
) WITH (
'connector' = 'kafka-0.11',
'topic' = 'my-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'python-test',
'format' = 'json'
);
See: Get nested fields from Kafka message using Apache Flink SQL
Is there a similar way to define nested types using Expressions?
I'm not registering streams using the Create Table command with connector because the data format is custom, which is why I resorted to registering a stream.
I played around with return types and nesting some more and this solution seems to work:
DataStream<Row> testDataStream = env.fromCollection(Arrays.asList(
Row.of("sherin", 1L, Row.of(100L, 200L)),
Row.of("thomas", 1L, Row.of(100L, 200L))
)).returns(
Types.ROW(
Types.STRING,
Types.LONG,
Types.ROW_NAMED(
new String[]{"val1", "val2"},
Types.LONG,
Types.LONG)
));
tableEnv.createTemporaryView(
"foo",
testDataStream,
$("f0").as("name"),
$("f1").as("c"),
$("f2").as("age"));
Table testTable = tableEnv.sqlQuery(
"select name, c, age.val1, age.val2 from foo"
);
DataStream<Row> result = tableEnv.toAppendStream(
testTable,
TypeInformation.of(Row.class)
);
result.print().setParallelism(2);
I'm still open to ideas, if there are better ways of doing this.
I am using Flink 1.12.0, and use the following code to work with batch data.
I would like to show the content of the table. When I call print, it complains that table can't be converted to Dataset,but I don't want to use BatchTableEnviroment, which is kind of old planner API.
test("batch test") {
val settings = EnvironmentSettings.newInstance().inBatchMode().useBlinkPlanner().build()
val tenv = TableEnvironment.create(settings)
val ddl =
"""
create table sourceTable(
key STRING,
`date` STRING,
price DOUBLE
) with (
'connector' = 'filesystem',
'path' = 'D:/projects/openprojects3/learn.flink.ioc/data/stock.csv',
'format' = 'csv'
)
""".stripMargin(' ')
tenv.executeSql(ddl)
val table = tenv.sqlQuery(
"""
select * from sourceTable
""".stripMargin(' '))
table.print()
}
To my knowledge a processing time temporal table join requires that the the lookup table is backed by a LookupTableSource. So far, in the Flink code base itself only the Hive, HBase and JDBC connector implement this interface. If you want to quickly try this out, you can also use [1], which also implements said interface.
[1] https://github.com/knaufk/flink-faker