I am using the Flink table API to pull data from a kinesis topic into a table. I want to periodically pull that data into a temporary table and run a custom scalar function on it. However, I notice that my scalar function is not being called at all.
Here is the code for the Kinesis table :
this.tableEnv.executeSql("CREATE TABLE transactions (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL METADATA FROM 'sequence-number' VIRTUAL,\n" +
" shard_id VARCHAR(128) NOT NULL METADATA FROM 'shard-id' VIRTUAL,\n" +
" arrival_time TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,\n" +
" WATERMARK FOR arrival_time AS arrival_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector' = 'kinesis',\n" +
" 'stream' = '" + streamName + "',\n" +
" 'aws.region' = 'us-west-2', \n" +
" 'format' = 'raw'\n" +
")");
Then, I want to periodically call a tumble every second which pulls data from kinesis and updates a temporary table.
My temporary table is defined like this:
this.tableEnv.executeSql("CREATE TABLE temporaryTable (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL,\n" +
" shard_id VARCHAR(128) NOT NULL,\n" +
" arrival_time TIMESTAMP(3),\n" +
" record_list STRING NOT NULL,\n" +
" PRIMARY KEY (shard_id, sequence_number) NOT ENFORCED" +
") WITH (\n" +
" 'connector' = 'print'\n" +
")");
I then have a code to do the tumbling :
Table inMemoryTable = transactions.
window(Tumble.over(lit(1).second()).on($("arrival_time")).as("log_ts"))
.groupBy($("entry"), $("sequence_number"), $("log_ts"), $("shard_id"), $("arrival_time"))
.select(
$("entry"),
$("sequence_number"), $("shard_id"), $("arrival_time"),
(call(CustomFunction.class, $("entry")).as("record_list")));
inMemoryTable.executeInsert(temporaryTable)
The CustomFunction class looks like this :
public class CustomFunction extends ScalarFunction {
#DataTypeHint("STRING")
public String eval(
#DataTypeHint("STRING") String serializedEntry) throws IOException {
return "asd";
}
When I run this code in Flink, I dont get anything in the stdout so obviously I am missing something.
Here is the Flink UI:
Image as link as I dont have enough rep
Thanks for any help.
I am able to get the stream to print with:
driver.tableEnv.getConfig().getConfiguration().setString("table.exec.source.idle", "10000 ms");
driver.env.getConfig().setAutoWatermarkInterval(5000);
Related
I'm new to Flink and hope someone can help. I have tried to follow Flink tutorials.
We have a requirement where we consume from:
kafka topic.
When an event arrives on kafka topic we need the json event fields (mobile_acc_id, member_id, mobile_number, valid_from, valid_to) to be stored in an external db (Postgres db)
kinesis stream.
When an event arrives on kinesis stream we need to look up the mobile_number, on the event, in Postgres DB (from step 1) and extract the "member_id" from db and enrich the incoming kinesis event and sink it to another output stream
So I set up a Stream and a Table environment like this:
public static StreamExecutionEnvironment initEnv() {
var env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L); //disables watermark
return env;
}
public static TableEnvironment initTableEnv() {
var settings = EnvironmentSettings.newInstance().inStreamingMode().build();
return TableEnvironment.create(settings);
}
calling process(..) methods with initEnv() will use kinesis as the source!
process(config.getTransformerConfig(), input, sink, deadLetterSink, initEnv());
In the process(..) am also initialising the Table Environment using initTableEnv() hoping that Flink with consume from both sources when I call env.execute(..):
public static void process(TransformerConfig cfg, SourceFunction<String> source, SinkFunction<UsageSummaryWithHeader> sink,
SinkFunction<DeadLetterEvent> deadLetterSink, StreamExecutionEnvironment env) throws Exception {
var events =
StreamUtils.source(source, env, "kinesis-events", cfg.getInputParallelism());
collectInSink(transform(cfg, events, deadLetterSink), sink, "kinesis-summary-events", cfg.getOutputParallelism());
processStreamIntoTable(initTableEnv());
env.execute("my-flink-event-enricher-svc");
}
private static void processStreamIntoTable(TableEnvironment tableEnv) throws Exception {
tableEnv.executeSql("CREATE TABLE mobile_accounts (\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'mobile_accounts',\n" +
" 'properties.bootstrap.servers' = 'kafka:9092',\n" +
" 'format' = 'json'\n" +
")");
tableEnv.executeSql("CREATE TABLE mobile_account\n" +
"(\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:postgresql://flinkpg:5432/flink-demo',\n" +
" 'table-name' = 'mobile_account',\n" +
" 'driver' = 'org.postgresql.Driver',\n" +
" 'username' = 'flink-demo',\n" +
" 'password' = 'flink-demo'\n" +
")");
Table mobileAccounts = tableEnv.from("mobile_accounts");
report(mobileAccounts).executeInsert("mobile_account");
}
public static Table report(Table mobileAccounts) {
return mobileAccounts.select(
$("mobile_acc_id"),
$("member_id"),
$("mobile_number"),
$("valid_from"),
$("valid_to"));
}
What I have noticed on the flink console is that it is only consuming from one Source!
I liked TableEnvironment as not much code is needed to get the items inserted into the DB.
How can we consume from both the sources, Kinesis and TableEnvironment in Flink?
Am I using the right approach?
Is there an alternative to implement my requirements?
I assume you are able to create the tables correct, then you can simply JOIN two streams named kafka_stream and kinesis_stream as
SELECT l.*, r.something_useful FROM kinesis_stream as l
INNER JOIN kafka_stream as r
ON l.member_id = r.member_id;
If PostgreSQL sink is essential, you can make it in a different query as
INSERT INTO postgre_sink
SELECT * FROM kafka_stream;
They will solve your problem with Table API (or Flink SQL).
I am using Flink 1.14 deployed by lyft flink operator
I am trying to make tumble window aggregate with the Table API, read from the transactions table source, and put the aggregate result by window into a new kafka topic
My source is a kafka topic from debezium
EnvironmentSettings settings = EnvironmentSettings.inStreamingMode();
TableEnvironment tEnv = TableEnvironment.create(settings);
//this is the source
tEnv.executeSql("CREATE TABLE transactions (\n" +
" event_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,\n"+
" transaction_time AS TO_TIMESTAMP_LTZ(4001, 3),\n"+
" id INT PRIMARY KEY,\n" +
" transaction_status STRING,\n" +
" transaction_type STRING,\n" +
" merchant_id INT,\n" +
" WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'debezium-json.schema-include' = 'true' ,\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'dbserver1.inventory.transactions',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'scan.startup.mode' = 'earliest-offset',\n"+
" 'format' = 'debezium-json'\n" +
")");
I do the tumble window and count the ids in the same window by:
public static Table report(Table transactions) {
return transactions
.window(Tumble.over(lit(2).minutes()).on($("transaction_time")).as("w"))
.groupBy($("w"), $("transaction_status"))
.select(
$("w").start().as("window_start"),
$("w").end().as("window_end"),
$("transaction_status"),
$("id").count().as("id_count"));
}
The sink is:
tEnv.executeSql("CREATE TABLE my_report (\n" +
"window_start TIMESTAMP(3),\n"+
"window_end TIMESTAMP(3)\n,"+
"transaction_status STRING,\n" +
" id_count BIGINT,\n" +
" PRIMARY KEY (window_start) NOT ENFORCED\n"+
") WITH (\n" +
" 'connector' = 'upsert-kafka',\n" +
" 'topic' = 'dbserver1.inventory.my-window-sink',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'key.format' = 'json',\n"+
" 'value.format' = 'json'\n"+
")");
Table transactions = tEnv.from("transactions");
Table merchants = tEnv.from("merchants");
report(transactions).executeInsert("my_report");
The problem is when I consume dbserver1.inventory.my-window-sink kubectl -n kafka exec my-cluster-kafka-0 -c kafka -i -t -- bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic dbserver1.inventory.my-window-sink --from-beginning I don't get any results, I wait 2 minutes (the window size), insert into the transactions table and then wait again for 2 min and insert again also no results.
I don't know if I have a problem with my watermark
I am working with parallelism: 2
On the flink dashboard UI I can see that in the Details of GroupWindowAggregate task the Records Received is increased when I insert into the table but still, I can't see the results when I consume the topic!
With this line
transaction_time AS TO_TIMESTAMP_LTZ(4001, 3)
you have given every event the same transaction time (4001), and with
WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND
you have arranged for the watermarks to depend on the transaction_time. With this arrangement, time is standing still, and the windows can never close.
As for "I wait 2 minutes (the window size)," this isn't how event time processing works. Assuming the timestamps and watermarks were actually moving forward, you would need to wait however long it takes to process 2 minutes worth of data.
In addition to what David thankfully answered, I was missing table.exec.source.idle-timeout as a configuration of the streaming environment, a variable that checks if the source becomes idle.
The default value of the variable is 0 which means that it doesn't check if the source becomes idle. I made it 1000ms and that fixed it as it checks for that idle source condition and the watermarks are generated properly that way.
this won't probably affect regular streams that have consistent message ingestion into them but was the case for me as I was inserting records manually and hence the stream was idle at a lot of times
I am using flink 1.11 and trying nested query where match_recognize is inside, as shown below :
select * from events where id = (SELECT * FROM events MATCH_RECOGNIZE (PARTITION BY org_id ORDER BY proctime MEASURES A.id AS startId ONE ROW PER MATCH PATTERN (A C* B) DEFINE A AS A.tag = 'tag1', C AS C.tag <> 'tag2', B AS B.tag = 'tag2'));
And I am getting an error as : org.apache.calcite.sql.validate.SqlValidatorException: Table 'A' not found
Is this not supported ? If not what's the alternative ?
I was able to get something working by doing this:
Table events = tableEnv.fromDataStream(input,
$("sensorId"),
$("ts").rowtime(),
$("kwh"));
tableEnv.createTemporaryView("events", events);
Table matches = tableEnv.sqlQuery(
"SELECT id " +
"FROM events " +
"MATCH_RECOGNIZE ( " +
"PARTITION BY sensorId " +
"ORDER BY ts " +
"MEASURES " +
"this_step.sensorId AS id " +
"AFTER MATCH SKIP TO NEXT ROW " +
"PATTERN (this_step next_step) " +
"DEFINE " +
"this_step AS TRUE, " +
"next_step AS TRUE " +
")"
);
tableEnv.createTemporaryView("mmm", matches);
Table results = tableEnv.sqlQuery(
"SELECT * FROM events WHERE events.sensorId IN (select * from mmm)");
tableEnv
.toAppendStream(results, Row.class)
.print();
For some reason, I couldn't get it to work without defining a view. I kept getting Calcite errors.
I guess you are trying to avoid enumerating all of the columns from A in the MEASURES clause of the MATCH_RECOGNIZE. You may want to compare the resulting execution plans to see if there's any significant difference.
I have a java application that for some installations acceses a PostgreSQL database, while in others it acceses essentially the same database in Derby.
I have a SQL query that returns an examination record from the examination table. There is an exam_procedure table that relates to the examination table in a one (examination) to many fashion. I need to concatenate the potentially multiple string records in the exam_procedure table so that I can add a single string value to the query return that represents all the related exam_procedure records. For a variety of reasons (eg, joins return too many records, especially when multiple subqueries are needed for other related one to many tables), I need to do this via a subquery in the SELECT section of the main query. The following SQL works just fine for PostgreSQL, but my understanding is that array_agg is not available in Derby. What Derby subquery can I substitute for the PostgreSQL subquery?
Many thanks.
// part of the query
"SELECT "
+ "patient_id, "
+ "examination_date, "
+ "examination_number, "
+ "operating_physician_id, "
+ "referring_physician_id, "
+ "patient.last_name AS pt_last_name, "
+ "patient.first_name AS pt_first_name, "
+ "patient.middle_name AS pt_middle_name, "
+ "("
+ "SELECT "
+ "array_agg(prose) "
+ "FROM "
+ "exam_procedure "
+ "WHERE examination_id = " + examId
+ " GROUP BY examination_id"
+ ") AS agg_procedures, "
+ "FROM "
+ "examination "
+ "JOIN patient ON patient.id = examination.patient_id "
+ "WHERE "
+ "examination.id = ?"
;
I am creating an SQLite database.
db.execSQL("CREATE TABLE " + DATABASE_TABLE + " ("
+ KEY_ROWID + " INTEGER PRIMARY KEY AUTOINCREMENT, "
+ KEY_NAME + " TEXT NOT NULL, "
+ KEY_WORKED + " INTEGER, "
+ KEY_NOTE + " INTEGER);");
Is it possible to set the default value of KEY_NOTE (which is an integer) for every row created to be 0 (zero)? If so, what should be the correct code?
Use the SQLite keyword default
db.execSQL("CREATE TABLE " + DATABASE_TABLE + " ("
+ KEY_ROWID + " INTEGER PRIMARY KEY AUTOINCREMENT, "
+ KEY_NAME + " TEXT NOT NULL, "
+ KEY_WORKED + " INTEGER, "
+ KEY_NOTE + " INTEGER DEFAULT 0);");
This link is useful: http://www.sqlite.org/lang_createtable.html
A column with default value:
CREATE TABLE <TableName>(
...
<ColumnName> <Type> DEFAULT <DefaultValue>
...
)
<DefaultValue> is a placeholder for a:
value literal
( expression )
Examples:
Count INTEGER DEFAULT 0,
LastSeen TEXT DEFAULT (datetime('now'))
It happens that I'm just starting to learn coding and I needed something similar as you have just asked in SQLite (I´m using [SQLiteStudio] (3.1.1)).
It happens that you must define the column's 'Constraint' as 'Not Null' then entering your desired definition using 'Default' 'Constraint' or it will not work (I don't know if this is an SQLite or the program requirment).
Here is the code I used:
CREATE TABLE <MY_TABLE> (
<MY_TABLE_KEY> INTEGER UNIQUE
PRIMARY KEY,
<MY_TABLE_SERIAL> TEXT DEFAULT (<MY_VALUE>)
NOT NULL
<THE_REST_COLUMNS>
);