I'm trying to implement a event-time temporal join but I don't see any data being emitted from the join. I don't see any runtime exceptions either.
Flink Version: 1.13
Kafka topics have only 1 partition for now
Here's how I set it up:
I have an "append-only" DataStream (left input/probe side) which looks like the following:
{
"eventType": String,
"eventTime": LocalDateTime,
"eventId": String
}
So, I convert this datastream to a table before joining them:
var eventTable = tableEnv.fromDataStream(eventStream, Schema.newBuilder()
.column("eventId", DataTypes.STRING())
.column("eventTime", DataTypes.TIMESTAMP(3))
.column("eventType", DataTypes.STRING())
.watermark("eventTime", $("eventTime"))
.build());
Then, I have the "versioned table" (right input/build side) backed by Kafka (Debezium CDC changelog) which looks like the following:
CREATE TABLE metadata (
id VARCHAR,
eventMetadata VARCHAR,
origin_ts TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,
PRIMARY KEY (id) NOT ENFORCED,
WATERMARK FOR origin_ts AS origin_ts
) WITH (
'connector' = 'kafka',
'properties.bootstrap.servers' = 'SERVER_ADDR',
'properties.group.id' = 'SOME_GROUP',
'topic' = 'SOME_TOPIC',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'debezium-json'
)
The join query looks like this:
SELECT e.eventId, e.eventTime, e.eventType, m.eventMetadata
FROM events_view AS e
JOIN metadata_view FOR SYSTEM_TIME AS OF e.eventTime AS m
ON e.eventId = m.id
Following some other post on here, I've set the source idle-timeout:
table.exec.source.idle-timeout -> 5
And, I've also tried setting IdlenessTime on the watermarks to make sure source doesn't back emitting the watermarks. At this point I can see watermarks being generated, but I still don't get any results. Everything just ends up sitting on the Temporal Join table.
So, the problem here was the syntax of the processing time temporal join. Here's how to fix this:
// register the metadata table as a temporal table func by specifying its watermark and primary-key attributes
var metadataHistory = tableEnv.from("metadata")
.createTemporalTableFunction($("proc_time"), $("id"));
tableEnv.createTemporarySystemFunction("metadata_view", metadataHistory);
// sql processing time temporal join
var temporalJoinResult = tableEnv.sqlQuery("SELECT" +
" e.eventId, e.eventType, e.eventTime, m.eventMetadata" +
" FROM events_view AS e," +
" LATERAL TABLE (metadata_view(t.procTime)) AS m" +
" WHERE e.eventId = m.id");
Here, proc_time on metadata needs to be declared within the table DDL like this,
CREATE TABLE metadata (
id VARCHAR,
eventMetadata VARCHAR,
proc_time as PROCTIME(),
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'properties.bootstrap.servers' = 'SERVER_ADDR',
'properties.group.id' = 'SOME_GROUP',
'topic' = 'SOME_TOPIC',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'debezium-json'
)
and while converting the datastream to table, assign the procTime there for that table as well like this,
var eventTable = tableEnv.fromDataStream(eventStream, Schema.newBuilder()
.column("eventId", DataTypes.STRING())
.column("eventTime", DataTypes.TIMESTAMP(3))
.column("eventType", DataTypes.STRING())
.columnByExpression("procTime", "PROCTIME()")
.build());
Related
I have a sql defined this way
CREATE TABLE raw_table
(
headers VARCHAR,
id VARCHAR,
type VARCHAR,
contentJson VARCHAR
) WITH (
'connector' = 'kafka',
'topic-pattern' = 'role__.+?',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'role_local_1',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
create view ROLES_NORMALIZED as
(
select
JSON_VALUE(contentJson, '$.id') as id,
rr.type as type
from raw_table rr
);
CREATE VIEW ROLES_UPSERTS_V1 AS
(
SELECT *
FROM ROLES_NORMALIZED
WHERE type in ('ROLE_CREATED', 'ROLE_UPDATED')
);
CREATE VIEW ROLES_DELETED_V1 AS
(
SELECT org,
pod,
tenantId,
id,
modified,
modified as deleted,
event_timestamp
FROM ROLES_NORMALIZED
WHERE type in ('ROLES_DELETED')
);
-------
CREATE TABLE final_topic
(
event_timestamp TIMESTAMP_LTZ,
id VARCHAR,
name VARCHAR,
deleted TIMESTAMP_LTZ,
PRIMARY KEY (pod, org, id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'final_topic',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'some-group-id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'properties.replication.factor' = '3',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
INSERT INTO final_topic
select
GREATEST(r.event_timestamp, d.event_timestamp) as event_timestamp,
r.id,
r.name,
d.deleted
from ROLES_UPSERTS_V1 r
LEFT JOIN ROLES_DELETED_V1 d
ON r.id = d.id;
The final_topic is produces the result i want to see, which is join of ROLES_UPSERTS_V1 and ROLES_DELETED_V1.
I tried this by publishing records to role__.+? topic.
What I am observing is that final-topic has null values as well. This is emitted when even the changelog kind happens to be -d -(DELETE). I understand the purpose as to why this exist. (here its saying the original message needs to be deleted and new one will follow). But I dont want such null values in my final-topic just the desired final state is this possible ?
Alternate that I am trying is to use Kafka connector. But the joins does not seems to work, as i get an error saying org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.final_topic' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]. I get error when i use view for ROLES_UPSERTS_V1 and ROLES_DELETED_V1 . But if i have these as tables (with kafka connector) only inner join works ( left join does not work).
If you don't want null values, you can consider buffering records before you flush the results to the Upsert Kafka connector. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/table/upsert-kafka/#sink-buffer-flush-max-rows for more details
Regarding your join, as outlined in the docs For streaming queries, the grammar of regular joins is the most flexible and allow for any kind of updating (insert, update, delete) input table. Since you're running a streaming query, a future change could mean that the result of your join os an update or a delete. However, the sink that you're trying to emit to does not support this, hence the error.
I am using Flink 1.12.0, and I am reading https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#processing-time-temporal-join. It looks that processing time temporal join is supported.
But, when I run the following application, it complains Processing-time temporal join is not supported yet. I am confused about whether it is code error or Flink really doesn't support Processing-time temporal join.
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val tenv = StreamTableEnvironment.create(env)
val ddl1 =
"""
create table s1(
id STRING,
tradeDate TIMESTAMP,
price DOUBLE,
pt as PROCTIME()
) with (
'connector' = 'filesystem',
'path' = 'D:/T018_LookupJoin_stock.csv',
'format' = 'csv'
)
""".stripMargin(' ')
tenv.executeSql(ddl1)
val ddl2 =
"""
create table s2(
id STRING primary key not enforced,
name STRING,
tradeDate TIMESTAMP
) with (
'connector' = 'filesystem',
'path' = 'D:/T018_LookupJoin_stocktimechanging.csv',
'format' = 'csv'
)
""".stripMargin(' ')
tenv.executeSql(ddl2)
val sql =
"""
select s1.id ,s1.price, s1.tradeDate, u.name, u.tradeDate
from s1 join s2 for SYSTEM_TIME as of s1.pt as u
on s1.id = u.id
""".stripMargin(' ')
tenv.sqlQuery(sql).toAppendStream[Row].print()
env.execute()
}
I'm trying to update a temporary table called #deletedRecords which looks like this:
With the data from a table called log that looks like this:
The KeyValue in the log table is the same as the ID in #deletedRecords.
There is a column in #deletedRecords for every FieldName for any particular key value.
I tried to extract the values using the following query:
UPDATE #deletedRecords
SET PatientName = (SELECT ACL.OldValue WHERE ACL.FieldName = 'CptCode'),
ChargeNotes = (SELECT ACL.OldValue WHERE ACL.FieldName = 'ChargeNotes'),
Units = (SELECT ACL.OldValue WHERE ACL.FieldName = 'Units'),
ChargeStatusID = (SELECT ACL.OldValue WHERE ACL.FieldName = 'Units')
FROM Log ACL
JOIN #deletedRecords DR ON ACL.KeyValue = DR.ID
WHERE ACL.TableName = 'BillingCharge'
AND ACL.EventType = 'DELETE'
However when I run the query all of the columns to be updated in #deletedRecords are null. Can somebody please help explain what I'm missing?
Thanks in advance.
EDIT:
In response to #Yogesh Sharma's answer, I elected to use the CTE method. I would this that using the values from the CTE to join to additional tables and extract their values during the update.
e.g. The Log table doesn't contain an old value for the StatusName but it does contain the ChargeStatusID which could be used to join to another table that contains that information such as this table ChargeStatus:
Thus I modified #Yogesh Sharma's code to the following:
WITH cte AS
...
UPDATE d
SET d.PatientName = c.PatientName
, d.StatusName = cs.StatusName
FROM #deletedBillingChargeTemp d
JOIN cte c ON c.KeyValue = d.chargeID
JOIN ChargeStatus cs ON c.ChargeStatusID = cs.ChargeStatusID
However, once I add that secondary join, all of the updated values return to null as they were before #Yogesh Sharma's suggestions were implemented.
Your query does not work because the UPDATE is executed multiple times for each row in DR, considering only the conditions specified in the last three rows of your query (not the ones specified in the subqueries). The values that remain in the table are the ones that correspond to the ACL row used in the last execution (and the order of execution cannot be controlled). If for ACL row used in the last execution the subqueries return NULL, you will get a NULL result.
See the example in the https://learn.microsoft.com/en-us/sql/t-sql/queries/update-transact-sql topic, where it says "The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.".
You should rewrite your query like this:
UPDATE #deletedRecords
SET PatientName = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'CptCode' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
ChargeNotes = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'ChargeNotes' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
Units = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'Units' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
ChargeStatusID = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'Units' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
)
FROM #deletedRecords DR
You would required to do some conditional aggregation for log table and do the JOINs in order to update the temporary table #deletedRecords records
So, the conditional approach could be achieve via CTE or Subquery
WITH cte AS
(
SELECT KeyValue,
MAX(CASE WHEN FieldName = 'CptCode' THEN OldValue END) PatientName,
MAX(CASE WHEN FieldName = 'ChargeNotes' THEN OldValue END) ChargeNotes,
...
FROM Log
WHERE TableName = 'BillingCharge' AND EventType = 'DELETE'
GROUP BY KeyValue
)
UPDATE d
SET d.PatientName = c.PatientName,
...
FROM #deletedRecords d
INNER JOIN cte c ON c.KeyValue = d.ID
The other way is to update your temporary table via correlation approach
UPDATE d
SET d.PatientName = (SELECT TOP 1 OldValue FROM Log WHERE KeyValue = d.ID AND
TableName = 'BillingCharge' AND EventType = 'DELETE' AND FieldName = 'CptCode'),
d.ChargeNotes= (SELECT TOP 1 OldValue FROM Log WHERE KeyValue = d.ID AND
TableName = 'BillingCharge' AND EventType = 'DELETE' AND FieldName = 'ChargeNotes'),
...
FROM #deletedRecords d
If your updated columns are NULL, these are it's possible causes:
Since you are doing a INNER JOIN, records might not be joining correctly by their joining column. Make sure both tables have the same values on the joining columns.
Since you are filtering in a WHERE clause, records might not fulfill your TableName and EventType filters. Make sure there are records that sucessfully INNER JOIN between them and they have the supplied TableName and EventType.
The values you are asigning are NULL. Make sure your subqueries return a not null value.
Table reference is off. When updating a table in SQL Server, always use the updating table alias if you are using one.
Use
UPDATE DR SET
YourColumn = Value
FROM
Log ACL
JOIN #deletedRecords DR ON -...
Instead of
UPDATE #deletedRecords SET
YourColumn = Value
FROM
Log ACL
JOIN #deletedRecords DR ON -...
Make sure you are NOT checking the variable table values on another batch, script or procedure. Variable tables scope are limited to current batch or procedure, while temporary tables remain as long as the session is alive.
Make sure that there isn't another statement that is setting those values as NULL after your update. Also keep an eye on your transactions (might not be commited or rolled back).
Not the most complex problem but I've been looking at this one for a while and I'm at a block. Here is a TSQL select statement I want to turn into an update statement. The table structures are included for reference (server, db, and table names simplified from what they actually are).
SELECT m.ConfirmationCode as CodeInMain,
e.ConfirmationCode as CodeInEvents
FROM Server1.DB1.dbo.MainTable AS m
INNER JOIN Server2.DB1.dbo.Events AS e ON c.AccountCode = e.AccountCode
AND c.Commodity = (
SELECT Alias
FROM Server2.DB1.dbo.Commodities
WHERE pkCommodityId = e.fkCommodityId )
WHERE m.AccountCode = e.AccountCode
AND m.Brand IN( 'FTR', 'CER' );
Here are the referenced table schemas:
Server1.DB1.dbo.MainTable (ConfirmationCode varchar(100), AccountCode varchar(100), Commodity varchar(1), Brand varchar(3))
Server2.DB1.dbo.Events (ConfirmationCode varchar(100), AccountCode varchar(100), Commodity int)
Server2.DB1.dbo.Commodities (pkCommodityId int, Alias varchar(100))
Server 2 is linked to Server 1.
The current output of the select statement is:
CodeInMain | CodeInEvents
--------------------------
AN235cAc0a | NULL
CSORSX239c | NULL
...
All of my outputted information is as expected.
My goal is to update e.ConfirmationCode as CodeInEvents with the data in m.ConfirmationCode as CodeInMain but I am getting stuck on how to accommodate for the join. I realize that a cursor can be used to cycle through the contents of the above output when stored in a temporary table, but I would really like to be able to do this in a single update statement.
SQL Server has a non-ANSI extended UPDATE syntax that supports JOIN:
UPDATE e
SET e.ConfirmationCode = m.ConfirmationCode
FROM Server1.DB1.dbo.MainTable AS m
INNER JOIN Server2.DB1.dbo.Events AS e ON c.AccountCode = e.AccountCode
AND c.Commodity = (
SELECT Alias
FROM Server2.DB1.dbo.Commodities
WHERE pkCommodityId = e.fkCommodityId )
WHERE m.AccountCode = e.AccountCode
AND m.Brand IN( 'FTR', 'CER' );
got it.
Update e
set e.ConfirmationCode = m.ConfirmationCode
from Server2.DB1.dbo.Events e
inner join Server1.DB1.dbo.MainTable m on m.AccountCode = e.AccountCode and m.Commodity = (select Alias from Server2.DB1.dbo.Commodities where pkCommodityId = e.fkCommodityId)
where m.AccountCode = e.AccountCode and m.Brand in ('FTR', 'CER')
When programming a large transaction (lots of inserts, deletes, updates) and thereby violating a constraint in Informix (v10, but should apply to other versions too) I get a not very helpful message saying, for example, I violated constraint r190_710. How can I find out which table(s) and key(s) are covered by a certain constraint I know only the name of?
Tony Andrews suggested (pointing to a different end-point for the URL):
From Informix Guide to SQL: Reference it appears you should look at the system catalog tables SYSCONSTRAINTS and SYSINDICES.
The Informix system catalog is described in that manual.
The SysConstraints table is the starting point for analyzing a constraint, most certainly; you find the constraint name in that table, and from there you can find out the other details.
However, you also have to look at other tables, and not just (or even directly) SysIndices.
For example, I have a lot of NOT NULL constraints on the tables in my database. For those, the constraint type is 'N' and there is no need to look elsewhere for more information.
A constraint type of 'P' indicates a primary key; that would need more analysis via the SysIndexes view or SysIndices table. Similarly, a constraint type of 'U' indicates a unique constraint and needs extra information from the SysIndexes view or SysIndices table.
A constraint type of 'C' indicates a check constraint; the text (and binary compiled form) of the constraint is found in the SysChecks table (with types 'T' and 'B' for the data; the data is more or less encoded with Base-64, though without the '=' padding at the end and using different characters for 62 and 63).
Finally, a constraint type of 'R' indicates a referential integrity constraint. You use the SysReferences table to find out which table is referenced, and you use SysIndexes or SysIndices to establish which indexes on the referencing and referenced tables are used, and from that you can discover the relevant columns. This can get quite hairy!
Columns in a table with a constraint on them
SELECT
a.tabname, b.constrname, d.colname
FROM
systables a, sysconstraints b, sysindexes c, syscolumns d
WHERE
a.tabname = 'your_table_name_here'
AND
b.tabid = a.tabid
AND
c.idxname = b.idxname
AND
d.tabid = a.tabid
AND
(
d.colno = c.part1 or
d.colno = c.part2 or
d.colno = c.part3 or
d.colno = c.part4 or
d.colno = c.part5 or
d.colno = c.part6 or
d.colno = c.part7 or
d.colno = c.part8 or
d.colno = c.part9 or
d.colno = c.part10 or
d.colno = c.part11 or
d.colno = c.part12 or
d.colno = c.part13 or
d.colno = c.part14 or
d.colno = c.part15 or
d.colno = c.part16
)
ORDER BY
a.tabname,
b.constrname,
d.colname
I have been using the following query to get more information about the different types of constraints.
It's based on some spelunking in the system tables and several explanations about the system catalog.
sysconstraints.constrtype indicates the type of the constraint:
P = Primary key
U = Unique key / Alternate key
N = Not null
C = Check
R = Reference / Foreign key
select
tab.tabname,
constr.*,
chk.*,
c1.colname col1,
c2.colname col2,
c3.colname col3,
c4.colname col4,
c5.colname col5
from sysconstraints constr
join systables tab on tab.tabid = constr.tabid
left outer join syschecks chk on chk.constrid = constr.constrid and chk.type = 'T'
left outer join sysindexes i on i.idxname = constr.idxname
left outer join syscolumns c1 on c1.tabid = tab.tabid and c1.colno = abs(i.part1)
left outer join syscolumns c2 on c2.tabid = tab.tabid and c2.colno = abs(i.part2)
left outer join syscolumns c3 on c3.tabid = tab.tabid and c3.colno = abs(i.part3)
left outer join syscolumns c4 on c4.tabid = tab.tabid and c4.colno = abs(i.part4)
left outer join syscolumns c5 on c5.tabid = tab.tabid and c5.colno = abs(i.part5)
where constr.constrname = 'your constraint name'
to get the table affected by the constraint "r190_710":
select TABNAME from SYSTABLES where TABID IN
(select TABID from sysconstraints where CONSTRID IN
(select CONSTRID from sysreferences where PTABID IN
(select TABID from sysconstraints where CONSTRNAME= "r190_710" )
)
);
From Informix Guide to SQL: Reference it appears you should look at the system catalog tables SYSCONSTRAINTS and SYSINDICES.
From surfing at www.iiug.org (International Informix Users Group) i found the not-so-easy solution.
(1) Get referential constraint data from the constraint name (you can get all constraints for a table by replacing "AND sc.constrname = ?" by "AND st.tabname MATCHES ?"). This statement selects some more fields than necessary here because they might be interesting in other situations.
SELECT si.part1, si.part2, si.part3, si.part4, si.part5,
si.part6, si.part7, si.part8, si.part9, si.part10,
si.part11, si.part12, si.part13, si.part14, si.part15, si.part16,
st.tabname, rt.tabname as reftable, sr.primary as primconstr,
sr.delrule, sc.constrid, sc.constrname, sc.constrtype,
si.idxname, si.tabid as tabid, rc.tabid as rtabid
FROM 'informix'.systables st, 'informix'.sysconstraints sc,
'informix'.sysindexes si, 'informix'.sysreferences sr,
'informix'.systables rt, 'informix'.sysconstraints rc
WHERE st.tabid = sc.tabid
AND st.tabtype != 'Q'
AND st.tabname NOT MATCHES 'cdr_deltab_[0-9][0-9][0-9][0-9][0-9][0-9]*'
AND rt.tabid = sr.ptabid
AND rc.tabid = sr.ptabid
AND sc.constrid = sr.constrid
AND sc.tabid = si.tabid
AND sc.idxname = si.idxname
AND sc.constrtype = 'R'
AND sc.constrname = ?
AND sr.primary = rc.constrid
ORDER BY si.tabid, sc.constrname
(2) Use the part1-part16 to determine which column is affected by the constraint: the part[n] containing a value different from 0 contains the column number of the used column. Use (3) to find the name of the column.
If constrtype is 'R' (referencing) use the following statement to find the parts of the referencing table:
SELECT part1, part2, part3, part4, part5, part6, part7, part8,
part9, part10, part11, part12, part13, part14, part15, part16
FROM 'informix'.sysindexes si, 'informix'.sysconstraints sc
WHERE si.tabid = sc.tabid
AND si.idxname = sc.idxname
AND sc.constrid = ? -- primconstr from (1)
(3) the tabid and rtabid (for referencing constraints) from (1) can now be used to get the columns of the tables like that:
SELECT colno, colname
FROM 'informix'.syscolumns
WHERE tabid = ? -- tabid(for referenced) or rtabid(for referencing) from (1)
AND colno = ? -- via parts from (1) and (2)
ORDER BY colno
(4) If the constrtype is 'C', then get the check information like this:
SELECT type, seqno, checktext
FROM 'informix'.syschecks
WHERE constrid = ? -- constrid from (1)
Quite hairy indeed
If your constraint is named constraint_c6, here's how to dump its definition (well sort-of, you still need to concatenate the rows, as they'll be separated by whitespace):
OUTPUT TO '/tmp/constraint_c6.sql' WITHOUT HEADINGS
SELECT ch.checktext
FROM syschecks ch, sysconstraints co
WHERE ch.constrid = co.constrid
AND ch.type = 'T' -- text lines only
AND co.constrname = 'constraint_c6'
ORDER BY ch.seqno;