Flink multi source with kafka, kinesis and TableEnvironment

Flink multi source with kafka, kinesis and TableEnvironment - apache-flink

I'm new to Flink and hope someone can help. I have tried to follow Flink tutorials.
We have a requirement where we consume from:
kafka topic.
When an event arrives on kafka topic we need the json event fields (mobile_acc_id, member_id, mobile_number, valid_from, valid_to) to be stored in an external db (Postgres db)
kinesis stream.
When an event arrives on kinesis stream we need to look up the mobile_number, on the event, in Postgres DB (from step 1) and extract the "member_id" from db and enrich the incoming kinesis event and sink it to another output stream
So I set up a Stream and a Table environment like this:
public static StreamExecutionEnvironment initEnv() {
var env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L); //disables watermark
return env;
}
public static TableEnvironment initTableEnv() {
var settings = EnvironmentSettings.newInstance().inStreamingMode().build();
return TableEnvironment.create(settings);
}
calling process(..) methods with initEnv() will use kinesis as the source!
process(config.getTransformerConfig(), input, sink, deadLetterSink, initEnv());
In the process(..) am also initialising the Table Environment using initTableEnv() hoping that Flink with consume from both sources when I call env.execute(..):
public static void process(TransformerConfig cfg, SourceFunction<String> source, SinkFunction<UsageSummaryWithHeader> sink,
SinkFunction<DeadLetterEvent> deadLetterSink, StreamExecutionEnvironment env) throws Exception {
var events =
StreamUtils.source(source, env, "kinesis-events", cfg.getInputParallelism());
collectInSink(transform(cfg, events, deadLetterSink), sink, "kinesis-summary-events", cfg.getOutputParallelism());
processStreamIntoTable(initTableEnv());
env.execute("my-flink-event-enricher-svc");
}
private static void processStreamIntoTable(TableEnvironment tableEnv) throws Exception {
tableEnv.executeSql("CREATE TABLE mobile_accounts (\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'mobile_accounts',\n" +
" 'properties.bootstrap.servers' = 'kafka:9092',\n" +
" 'format' = 'json'\n" +
")");
tableEnv.executeSql("CREATE TABLE mobile_account\n" +
"(\n" +
" mobile_acc_id VARCHAR(36) NOT NULL,\n" +
" member_id BIGINT NOT NULL,\n" +
" mobile_number VARCHAR(14) NOT NULL,\n" +
" valid_from TIMESTAMP NOT NULL,\n" +
" valid_to TIMESTAMP NOT NULL \n" +
") WITH (\n" +
" 'connector' = 'jdbc',\n" +
" 'url' = 'jdbc:postgresql://flinkpg:5432/flink-demo',\n" +
" 'table-name' = 'mobile_account',\n" +
" 'driver' = 'org.postgresql.Driver',\n" +
" 'username' = 'flink-demo',\n" +
" 'password' = 'flink-demo'\n" +
")");
Table mobileAccounts = tableEnv.from("mobile_accounts");
report(mobileAccounts).executeInsert("mobile_account");
}
public static Table report(Table mobileAccounts) {
return mobileAccounts.select(
$("mobile_acc_id"),
$("member_id"),
$("mobile_number"),
$("valid_from"),
$("valid_to"));
}
What I have noticed on the flink console is that it is only consuming from one Source!
I liked TableEnvironment as not much code is needed to get the items inserted into the DB.
How can we consume from both the sources, Kinesis and TableEnvironment in Flink?
Am I using the right approach?
Is there an alternative to implement my requirements?

I assume you are able to create the tables correct, then you can simply JOIN two streams named kafka_stream and kinesis_stream as
SELECT l.*, r.something_useful FROM kinesis_stream as l
INNER JOIN kafka_stream as r
ON l.member_id = r.member_id;
If PostgreSQL sink is essential, you can make it in a different query as
INSERT INTO postgre_sink
SELECT * FROM kafka_stream;
They will solve your problem with Table API (or Flink SQL).

Related

Flink Table print connector not being called

I am using the Flink table API to pull data from a kinesis topic into a table. I want to periodically pull that data into a temporary table and run a custom scalar function on it. However, I notice that my scalar function is not being called at all.
Here is the code for the Kinesis table :
this.tableEnv.executeSql("CREATE TABLE transactions (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL METADATA FROM 'sequence-number' VIRTUAL,\n" +
" shard_id VARCHAR(128) NOT NULL METADATA FROM 'shard-id' VIRTUAL,\n" +
" arrival_time TIMESTAMP(3) METADATA FROM 'timestamp' VIRTUAL,\n" +
" WATERMARK FOR arrival_time AS arrival_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'connector' = 'kinesis',\n" +
" 'stream' = '" + streamName + "',\n" +
" 'aws.region' = 'us-west-2', \n" +
" 'format' = 'raw'\n" +
")");
Then, I want to periodically call a tumble every second which pulls data from kinesis and updates a temporary table.
My temporary table is defined like this:
this.tableEnv.executeSql("CREATE TABLE temporaryTable (\n" +
" entry STRING,\n" +
" sequence_number VARCHAR(128) NOT NULL,\n" +
" shard_id VARCHAR(128) NOT NULL,\n" +
" arrival_time TIMESTAMP(3),\n" +
" record_list STRING NOT NULL,\n" +
" PRIMARY KEY (shard_id, sequence_number) NOT ENFORCED" +
") WITH (\n" +
" 'connector' = 'print'\n" +
")");
I then have a code to do the tumbling :
Table inMemoryTable = transactions.
window(Tumble.over(lit(1).second()).on($("arrival_time")).as("log_ts"))
.groupBy($("entry"), $("sequence_number"), $("log_ts"), $("shard_id"), $("arrival_time"))
.select(
$("entry"),
$("sequence_number"), $("shard_id"), $("arrival_time"),
(call(CustomFunction.class, $("entry")).as("record_list")));
inMemoryTable.executeInsert(temporaryTable)
The CustomFunction class looks like this :
public class CustomFunction extends ScalarFunction {
#DataTypeHint("STRING")
public String eval(
#DataTypeHint("STRING") String serializedEntry) throws IOException {
return "asd";
}
When I run this code in Flink, I dont get anything in the stdout so obviously I am missing something.
Here is the Flink UI:
Image as link as I dont have enough rep
Thanks for any help.

I am able to get the stream to print with:
driver.tableEnv.getConfig().getConfiguration().setString("table.exec.source.idle", "10000 ms");
driver.env.getConfig().setAutoWatermarkInterval(5000);

No results in kafka topic sink when applying tumble window aggregation in Flink Table API

I am using Flink 1.14 deployed by lyft flink operator
I am trying to make tumble window aggregate with the Table API, read from the transactions table source, and put the aggregate result by window into a new kafka topic
My source is a kafka topic from debezium
EnvironmentSettings settings = EnvironmentSettings.inStreamingMode();
TableEnvironment tEnv = TableEnvironment.create(settings);
//this is the source
tEnv.executeSql("CREATE TABLE transactions (\n" +
" event_time TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,\n"+
" transaction_time AS TO_TIMESTAMP_LTZ(4001, 3),\n"+
" id INT PRIMARY KEY,\n" +
" transaction_status STRING,\n" +
" transaction_type STRING,\n" +
" merchant_id INT,\n" +
" WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND\n" +
") WITH (\n" +
" 'debezium-json.schema-include' = 'true' ,\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'dbserver1.inventory.transactions',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'scan.startup.mode' = 'earliest-offset',\n"+
" 'format' = 'debezium-json'\n" +
")");
I do the tumble window and count the ids in the same window by:
public static Table report(Table transactions) {
return transactions
.window(Tumble.over(lit(2).minutes()).on($("transaction_time")).as("w"))
.groupBy($("w"), $("transaction_status"))
.select(
$("w").start().as("window_start"),
$("w").end().as("window_end"),
$("transaction_status"),
$("id").count().as("id_count"));
}
The sink is:
tEnv.executeSql("CREATE TABLE my_report (\n" +
"window_start TIMESTAMP(3),\n"+
"window_end TIMESTAMP(3)\n,"+
"transaction_status STRING,\n" +
" id_count BIGINT,\n" +
" PRIMARY KEY (window_start) NOT ENFORCED\n"+
") WITH (\n" +
" 'connector' = 'upsert-kafka',\n" +
" 'topic' = 'dbserver1.inventory.my-window-sink',\n" +
" 'properties.bootstrap.servers' = 'my-cluster-kafka-bootstrap.kafka.svc:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
" 'key.format' = 'json',\n"+
" 'value.format' = 'json'\n"+
")");
Table transactions = tEnv.from("transactions");
Table merchants = tEnv.from("merchants");
report(transactions).executeInsert("my_report");
The problem is when I consume dbserver1.inventory.my-window-sink kubectl -n kafka exec my-cluster-kafka-0 -c kafka -i -t -- bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic dbserver1.inventory.my-window-sink --from-beginning I don't get any results, I wait 2 minutes (the window size), insert into the transactions table and then wait again for 2 min and insert again also no results.
I don't know if I have a problem with my watermark
I am working with parallelism: 2
On the flink dashboard UI I can see that in the Details of GroupWindowAggregate task the Records Received is increased when I insert into the table but still, I can't see the results when I consume the topic!

With this line
transaction_time AS TO_TIMESTAMP_LTZ(4001, 3)
you have given every event the same transaction time (4001), and with
WATERMARK FOR transaction_time AS transaction_time - INTERVAL '5' SECOND
you have arranged for the watermarks to depend on the transaction_time. With this arrangement, time is standing still, and the windows can never close.
As for "I wait 2 minutes (the window size)," this isn't how event time processing works. Assuming the timestamps and watermarks were actually moving forward, you would need to wait however long it takes to process 2 minutes worth of data.

In addition to what David thankfully answered, I was missing table.exec.source.idle-timeout as a configuration of the streaming environment, a variable that checks if the source becomes idle.
The default value of the variable is 0 which means that it doesn't check if the source becomes idle. I made it 1000ms and that fixed it as it checks for that idle source condition and the watermarks are generated properly that way.
this won't probably affect regular streams that have consistent message ingestion into them but was the case for me as I was inserting records manually and hence the stream was idle at a lot of times

Nested match_recognize query not supported in flink SQL?

I am using flink 1.11 and trying nested query where match_recognize is inside, as shown below :
select * from events where id = (SELECT * FROM events MATCH_RECOGNIZE (PARTITION BY org_id ORDER BY proctime MEASURES A.id AS startId ONE ROW PER MATCH PATTERN (A C* B) DEFINE A AS A.tag = 'tag1', C AS C.tag <> 'tag2', B AS B.tag = 'tag2'));
And I am getting an error as : org.apache.calcite.sql.validate.SqlValidatorException: Table 'A' not found
Is this not supported ? If not what's the alternative ?

I was able to get something working by doing this:
Table events = tableEnv.fromDataStream(input,
$("sensorId"),
$("ts").rowtime(),
$("kwh"));
tableEnv.createTemporaryView("events", events);
Table matches = tableEnv.sqlQuery(
"SELECT id " +
"FROM events " +
"MATCH_RECOGNIZE ( " +
"PARTITION BY sensorId " +
"ORDER BY ts " +
"MEASURES " +
"this_step.sensorId AS id " +
"AFTER MATCH SKIP TO NEXT ROW " +
"PATTERN (this_step next_step) " +
"DEFINE " +
"this_step AS TRUE, " +
"next_step AS TRUE " +
")"
);
tableEnv.createTemporaryView("mmm", matches);
Table results = tableEnv.sqlQuery(
"SELECT * FROM events WHERE events.sensorId IN (select * from mmm)");
tableEnv
.toAppendStream(results, Row.class)
.print();
For some reason, I couldn't get it to work without defining a view. I kept getting Calcite errors.
I guess you are trying to avoid enumerating all of the columns from A in the MEASURES clause of the MATCH_RECOGNIZE. You may want to compare the resulting execution plans to see if there's any significant difference.

Flink 1.11 with debezium-json format

In Flink 1.11, I'm trying debezium-format and the following should work, right? I'm trying to follow docs [1]
TableResult products = bsTableEnv.executeSql(
"CREATE TABLE products (\n" +
" id BIGINT,\n" +
" name STRING,\n" +
" description STRING,\n" +
" weight DECIMAL(10, 2)\n" +
") WITH (\n" +
" 'connector' = 'kafka',\n" +
" 'topic' = 'dbserver1.inventory.products',\n" +
" 'properties.bootstrap.servers' = 'localhost:9092',\n" +
" 'properties.group.id' = 'testGroup',\n" +
"'scan.startup.mode'='earliest-offset',\n" +
" 'format' = 'debezium-json'" +
")"
);
bsTableEnv.executeSql("SHOW TABLES").print(); // This seems to work;
bsTableEnv.executeSql("SELECT id FROM products").print();
Output Snippet / Exception:
+------------+
| table name |
+------------+
| products |
+------------+
1 row in set
Exception in thread "main" org.apache.flink.table.api.TableException: AppendStreamTableSink doesn't support consuming update and delete changes which is produced by node TableSourceScan(table=[[default_catalog, default_database, products]], fields=[id, name, description, weight])
I have verified Debezium setup and there are messages in the dbserver1.inventory.products topic. I'm able to read from Kafka topics in Flink using other approaches, but as previously described, I'm hoping to get the debezium-json format to work.
Also, I understand Flink 1.12 introduces new Kafka Upsert connector, but I'm stuck using 1.11 for now.
I'm pretty new to Flink, so entirely possible I'm missing something obvious here.
Thanks in advance
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/debezium.html

Asked too soon, it seems. In case it possibly helps someone else, I was able to get it to work with
Table results = bsTableEnv.sqlQuery("SELECT id, name FROM products");
bsTableEnv.toRetractStream(results, Row.class).print();

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object

I inherited a project and I'm running into a SQL error that I'm not sure how to fix.
On an eCommerce site, the code is inserting order shipping info into another database table.
Here's the code that is inserting the info into the table:
string sql = "INSERT INTO AC_Shipping_Addresses
(pk_OrderID, FullName, Company, Address1, Address2, City, Province, PostalCode, CountryCode, Phone, Email, ShipMethod, Charge_Freight, Charge_Subtotal)
VALUES (" + _Order.OrderNumber;
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToCompany == "")
{
sql += ", '" + _Order.Shipments[0].ShipToFullName.Replace("'", "''") + "'";
}
else
{
sql += ", '" + _Order.Shipments[0].ShipToCompany.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].Address.Address1.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Address2.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.City.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Province.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.PostalCode.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Country.Name.Replace("'", "''") + "'";
sql += ", '" + _Order.Shipments[0].Address.Phone.Replace("'", "''") + "'";
if (_Order.Shipments[0].ShipToEmail == "")
{
sql += ",'" + _Order.BillToEmail.Replace("'", "''") + "'";
}
else
{
sql += ",'" + _Order.Shipments[0].ShipToEmail.Replace("'", "''") + "'";
}
sql += ", '" + _Order.Shipments[0].ShipMethod.Name.Replace("'", "''") + "'";
sql += ", " + shippingAmount;
sql += ", " + _Order.ProductSubtotal.ToString() + ")";
bll.dbUpdate(sql);
It is working correctly, but it is also outputting the following SQL error:
Violation of PRIMARY KEY constraint 'PK_AC_Shipping_Addresses'. Cannot insert
duplicate key in object 'dbo.AC_Shipping_Addresses'. The duplicate key value
is (165863).
From reading similar questions, it seems that I should declare the ID in the statement.
Is that correct? How would I adjust the code to fix this issue?

I was getting the same error on a restored database when I tried to insert a new record using the EntityFramework. It turned out that the Indentity/Seed was screwing things up.
Using a reseed command fixed it.
DBCC CHECKIDENT ('[Prices]', RESEED, 4747030);GO

I'm pretty sure pk_OrderID is the PK of AC_Shipping_Addresses
And you are trying to insert a duplicate via the _Order.OrderNumber?
Do a
select * from AC_Shipping_Addresses where pk_OrderID = 165863;
or select count(*) ....
Pretty sure you will get a row returned.
It is telling you that you are already using pk_OrderID = 165863 and cannot have another row with that value.
if you want to not insert if there is a row
insert into table (pk, value)
select 11 as pk, 'val' as value
where not exists (select 1 from table where pk = 11)

What is the value you're passing to the primary key (presumably "pk_OrderID")? You can set it up to auto increment, and then there should never be a problem with duplicating the value - the DB will take care of that. If you need to specify a value yourself, you'll need to write code to determine what the max value for that field is, and then increment that.
If you have a column named "ID" or such that is not shown in the query, that's fine as long as it is set up to autoincrement - but it's probably not, or you shouldn't get that err msg. Also, you would be better off writing an easier-on-the-eye query and using params. As the lad of nine years hence inferred, you're leaving your database open to SQL injection attacks if you simply plop in user-entered values. For example, you could have a method like this:
internal static int GetItemIDForUnitAndItemCode(string qry, string unit, string itemCode)
{
int itemId;
using (SqlConnection sqlConn = new SqlConnection(ReportRunnerConstsAndUtils.CPSConnStr))
{
using (SqlCommand cmd = new SqlCommand(qry, sqlConn))
{
cmd.CommandType = CommandType.Text;
cmd.Parameters.Add("#Unit", SqlDbType.VarChar, 25).Value = unit;
cmd.Parameters.Add("#ItemCode", SqlDbType.VarChar, 25).Value = itemCode;
sqlConn.Open();
itemId = Convert.ToInt32(cmd.ExecuteScalar());
}
}
return itemId;
}
...that is called like so:
int itemId = SQLDBHelper.GetItemIDForUnitAndItemCode(GetItemIDForUnitAndItemCodeQuery, _unit, itemCode);
You don't have to, but I store the query separately:
public static readonly String GetItemIDForUnitAndItemCodeQuery = "SELECT PoisonToe FROM Platypi WHERE Unit = #Unit AND ItemCode = #ItemCode";
You can verify that you're not about to insert an already-existing value by (pseudocode):
bool alreadyExists = IDAlreadyExists(query, value) > 0;
The query is something like "SELECT COUNT FROM TABLE WHERE BLA = #CANDIDATEIDVAL" and the value is the ID you're potentially about to insert:
if (alreadyExists) // keep inc'ing and checking until false, then use that id value
Justin wants to know if this will work:
string exists = "SELECT 1 from AC_Shipping_Addresses where pk_OrderID = " _Order.OrderNumber; if (exists > 0)...
What seems would work to me is:
string existsQuery = string.format("SELECT 1 from AC_Shipping_Addresses where pk_OrderID = {0}", _Order.OrderNumber);
// Or, better yet:
string existsQuery = "SELECT COUNT(*) from AC_Shipping_Addresses where pk_OrderID = #OrderNumber";
// Now run that query after applying a value to the OrderNumber query param (use code similar to that above); then, if the result is > 0, there is such a record.

To prevent inserting a record that exist already. I'd check if the ID value exists in the database. For the example of a Table created with an IDENTITY PRIMARY KEY:
CREATE TABLE [dbo].[Persons] (
ID INT IDENTITY(1,1) PRIMARY KEY,
LastName VARCHAR(40) NOT NULL,
FirstName VARCHAR(40)
);
When JANE DOE and JOE BROWN already exist in the database.
SET IDENTITY_INSERT [dbo].[Persons] OFF;
INSERT INTO [dbo].[Persons] (FirstName,LastName)
VALUES ('JANE','DOE');
INSERT INTO Persons (FirstName,LastName)
VALUES ('JOE','BROWN');
DATABASE OUTPUT of TABLE [dbo].[Persons] will be:
ID LastName FirstName
1 DOE Jane
2 BROWN JOE
I'd check if i should update an existing record or insert a new one. As the following JAVA example:
int NewID = 1;
boolean IdAlreadyExist = false;
// Using SQL database connection
// STEP 1: Set property
System.setProperty("java.net.preferIPv4Stack", "true");
// STEP 2: Register JDBC driver
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
// STEP 3: Open a connection
try (Connection conn1 = DriverManager.getConnection(DB_URL, USER,pwd) {
conn1.setAutoCommit(true);
String Select = "select * from Persons where ID = " + ID;
Statement st1 = conn1.createStatement();
ResultSet rs1 = st1.executeQuery(Select);
// iterate through the java resultset
while (rs1.next()) {
int ID = rs1.getInt("ID");
if (NewID==ID) {
IdAlreadyExist = true;
}
}
conn1.close();
} catch (SQLException e1) {
System.out.println(e1);
}
if (IdAlreadyExist==false) {
//Insert new record code here
} else {
//Update existing record code here
}

Not OP's answer but as this was the first question that popped up for me in google, Id also like to add that users searching for this might need to reseed their table, which was the case for me
DBCC CHECKIDENT(tablename)

There could be several things causing this and it somewhat depends on what you have set up in your database.
First, you could be using a PK in the table that is also an FK to another table making the relationship 1-1. IN this case you may need to do an update rather than an insert. If you really can have only one address record for an order this may be what is happening.
Next you could be using some sort of manual process to determine the id ahead of time. The trouble with those manual processes is that they can create race conditions where two records gab the same last id and increment it by one and then the second one can;t insert.
Third, you query as it is sent to the database may be creating two records. To determine if this is the case, Run Profiler to see exactly what SQL code you are sending and if ti is a select instead of a values clause, then run the select and see if you have due to the joins gotten some records to be duplicated. IN any even when you are creating code on the fly like this the first troubleshooting step is ALWAYS to run Profiler and see if what got sent was what you expected to be sent.

Make sure if your table doesn't already have rows whose Primary Key values are same as the the Primary Key Id in your Query.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Flink multi source with kafka, kinesis and TableEnvironment - apache-flink

Related

Flink Table print connector not being called

No results in kafka topic sink when applying tumble window aggregation in Flink Table API

Nested match_recognize query not supported in flink SQL?

Flink 1.11 with debezium-json format

Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object

Categories

Resources