I'm trying to extract a few nested fields in PyFlink from JSON data received from Kafka. The JSON record schema is as follows. Basically, each record has a Result object within which there's an array of objects called data. I'm trying to extract the value field from the first array element i.e. data[0].
{
'ID': 'some-id',
'Result': {
'data': [
{
'value': 65537,
...
...
}
]
}
}
I'm using the Table API to read from a Kafka topic and write the extracted fields to another topic.
The source DDL is as follows:
source_ddl = """
CREATE TABLE InTable (
`ID` STRING,
`Timestamp` TIMESTAMP(3),
`Result` ROW(
`data` ROW(`value` BIGINT) ARRAY),
WATERMARK FOR `Timestamp` AS `Timestamp`
) WITH (
'connector' = 'kafka',
'topic' = 'in-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'my-group-id',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
)
"""
The corresponding sink DDL is:
sink_ddl = """
CREATE TABLE OutTable (
`ID` STRING,
`value` BIGINT
) WITH (
'connector' = 'kafka',
'topic' = 'out-topic',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'my-group-id',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json'
)
"""
Here's the code snippet for extracting the value field from first element of the array:
t_env.execute_sql(source_ddl)
t_env.execute_sql(sink_ddl)
table = t_env.from_path('InTable')
table \
.select(
table.ID,
table.Result.data.at(1).value) \
.execute_insert('OutTable') \
.wait()
I see the following error in the execute_insert step, when I execute this.
py4j.protocol.Py4JJavaError: An error occurred while calling o57.executeInsert.
: scala.MatchError: ITEM($9.data, 1) (of class org.apache.calcite.rex.RexCall)
However, if I don't extract the embedded value but rather the entire row of the array i.e. table.Result.data.at(1) and modify the sink_ddl appropriately, I'm able to get the entire row properly.
Any idea, what am I missing? Thanks for any pointers!
Edit: This is probably a bug in Flink, and it being tracked by https://issues.apache.org/jira/browse/FLINK-22082.
Related
Based upon the pyflink walkthrough, I'm trying to now get a simple nested row query working using apache-flink==1.14.4. I've created my table structure based upon this solution: Get nested fields from Kafka message using Apache Flink SQL
A message looks like this:
{"signature": {"token": "abcd1234"}}
The relevant part of the code looks like this:
create_kafka_source_ddl = """
CREATE TABLE nested_msg (
`signature` ROW (
`token` STRING
)
) WITH (
'connector' = 'kafka',
'topic' = 'nested_msg',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'nested-msg',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
create_es_sink_ddl = """
CREATE TABLE es_sink (
token STRING
) WITH (
'connector' = 'elasticsearch-7',
'hosts' = 'http://elasticsearch:9200',
'index' = 'nested_count_1',
'document-id.key-delimiter' = '$',
'sink.bulk-flush.max-size' = '42mb',
'sink.bulk-flush.max-actions' = '32',
'sink.bulk-flush.interval' = '1000',
'sink.bulk-flush.backoff.delay' = '1000',
'format' = 'json'
)
"""
t_env.execute_sql(create_kafka_source_ddl)
t_env.execute_sql(create_es_sink_ddl)
# How do I select the nested field here?
t_env.from_path("nested_msg").select(col("signature.token").alias("token")).select(
"token"
).execute_insert("es_sink")
I've tried numerous variations here without success. The exception is:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.select.
: org.apache.flink.table.api.ValidationException: Cannot resolve field [signature.token], input field list:[signature].
How can I selected a nested field like this in order to insert it into my sink?
You can change col("signature.token") to col("signature").get('token').
I want to read multi directories with Table API in PyFlink,
from pyflink.table import StreamTableEnvironment
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
if __name__ == 'main__':
env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.BATCH)
env.set_parallelism(1)
table_env = StreamTableEnvironment.create(stream_execution_environment=env)
table_env \
.get_config() \
.get_configuration() \
.set_string("default.parallelism", "1")
ddl = """
CREATE TABLE test (
a INT,
b STRING
) WITH (
'connector' = 'filesystem',
'path' = '{path}',
'format' = 'csv',
'csv.ignore-first-line' = 'true',
'csv.ignore-parse-errors' = 'true',
'csv.array-element-delimiter' = ';'
)
""".format(path='/opt/data/day=2021-11-14,/opt/data/day=2021-11-15,/opt/data/day=2021-11-16')
table_env.execute_sql(ddl)
But failed with the following error:
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: File /opt/data/day=2021-11-14,/opt/data/day=2021-11-15,/opt/data/day=2021-11-16 does not exist or the user running Flink ('root') has insufficient permissions to access it.
I'm sure these three directories exists and I have permissions to access it:
/opt/data/day=2021-11-14,
/opt/data/day=2021-11-15,
/opt/data/day=2021-11-16
If not able to read multi directories, I have to create three tables, and union them, which is much more verbose.
Any suggestion is appreciative. Thank you
Just using
'path' = '/opt/data'
Should be sufficient. The filesystem connector is also able to read the partition field and perform filtering based on it. For example you can define the table with this schema:
CREATE TABLE test (
a INT,
b STRING,
day DATE
) PARTITIONED BY (day) WITH (
'connector' = 'filesystem',
'path' = '/opt/data',
[...]
)
And then the following query:
SELECT * FROM test WHERE day = '2021-11-14'
Will read only the file /opt/data/day=2021-11-14
I see examples that convert a Flink Table object to a DataStream and run StreamExecutionEnvironment.execute.
how would I code + run a continuous query that writes to a Streaming Sink with the table API without converting to a DataStream.
It seems this must be possible, because otherwise what is the purpose of specifying streaming sink Table Connectors?
The Table API docs list continuous queries and dynamic tables, yet most of the actual Java APIs and code examples seem to only use the table API for batch.
EDIT: To show David Anderson what I'm trying, here are the three Flink SQL CREATE TABLE statements on top of analogous Derby SQL tables.
I see the JDBC table connector sink supports streaming, but am I not configuring this correctly? I don't see anything that I'm overlooking.
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/jdbc.html
FYI, when I get my toy example working, I am planning on using Kafka in production for input/output stream-like data and JDBC/SQL for the lookup table.
CREATE TABLE LookupTableFlink (
`lookup_key` STRING NOT NULL,
`lookup_value` STRING NOT NULL,
PRIMARY KEY (lookup_key) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'LookupTable'
),
CREATE TABLE IncomingEventsFlink (
`field_to_use_as_lookup_key` STRING NOT NULL,
`extra_field` INTEGER NOT NULL,
`proctime` AS PROCTIME()
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'IncomingEvents'
), jdbcUrl);
CREATE TABLE TransformedEventsFlink (
`field_to_use_as_lookup_key` STRING,
`extra_field` INTEGER,
`lookup_key` STRING,
`lookup_value` STRING
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'TransformedEvents'
), jdbcUrl);
String sqlQuery =
"SELECT\n" +
" IncomingEventsFlink.field_to_use_as_lookup_key, IncomingEventsFlink.extra_field,\n" +
" LookupTableFlink.lookup_key, LookupTableFlink.lookup_value\n" +
"FROM IncomingEventsFlink\n" +
"LEFT JOIN LookupTableFlink FOR SYSTEM_TIME AS OF IncomingEventsFlink.proctime\n" +
"ON (IncomingEventsFlink.field_to_use_as_lookup_key = LookupTableFlink.lookup_key)\n";
Table joinQuery = tableEnv.sqlQuery(sqlQuery);
// This seems to run, return, and complete and doesn't seem to run in continuous/streaming mode.
TableResult tableResult = joinQuery.executeInsert(TransformedEventsFlink);
You can write to a dynamic table by using executeInsert, as in
Table orders = tableEnv.from("Orders");
orders.executeInsert("OutOrders");
The documentation is here.
It's explained here.
code example can be found here:
// get StreamTableEnvironment.
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
// convert the Table into an append DataStream of Tuple2<String, Integer>
// via a TypeInformation
TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>(
Types.STRING(),
Types.INT());
DataStream<Tuple2<String, Integer>> dsTuple =
tableEnv.toAppendStream(table, tupleType);
// convert the Table into a retract DataStream of Row.
// A retract stream of type X is a DataStream<Tuple2<Boolean, X>>.
// The boolean field indicates the type of the change.
// True is INSERT, false is DELETE.
DataStream<Tuple2<Boolean, Row>> retractStream =
tableEnv.toRetractStream(table, Row.class);
I'm using the Flink FileSystem SQL Connector to read events from Kafka and write to S3(Using MinIo). Here is my code,
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 10 s
exec_env.enable_checkpointing(10000)
exec_env.set_state_backend(FsStateBackend("s3://test-bucket/checkpoints/"))
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = "source"
INPUT_TOPIC = "Rides"
LOCAL_KAFKA = 'kafka:9092'
OUTPUT_TABLE = "sink"
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'kafka',
'topic' = '{INPUT_TOPIC}',
'properties.bootstrap.servers' = '{LOCAL_KAFKA}',
'format' = 'json'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'filesystem',
'path' = 's3://test-bucket/kafka_output',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
I'm using Flink 1.11.3 and flink-s3-fs-hadoop 1.11.3. I have copied the flink-s3-fs-hadoop-1.11.3.jar into the plugins folder.
cp /opt/flink/lib/flink-s3-fs-hadoop-1.11.3.jar /opt/flink/plugins/s3-fs-hadoop/;
Also I have added the following configs into the flink-conf.yaml.
state.backend: filesystem
state.checkpoints.dir: s3://test-bucket/checkpoints/
s3.endpoint: http://127.0.0.1:9000
s3.path.style.access: true
s3.access-key: minio
s3.secret-key: minio123
MinIo is running properly and I have created the 'test-bucket' in MinIo. When I run this job the job submission doesn't happen and Flink Dashboard goes to a some sort of waiting state. After 15-20 mins I get the following exception,
pyflink.util.exceptions.TableException: Failed to execute sql
What seems to be the problem here?
see example :
create table conv (
SM ROW(objectType STRING, verb string, actor ROW(orgId string), object ROW(contentCategory string, links ARRAY<ROW(ecmType STRING)>)),
timestamp string
)
WITH (
'connector' = 'kafka', -- using kafka connector
'topic' = 'sj1_spark_conv_hdfs',
)
how to use explode on SM.object.ecmType instead of using ecmType[1]