upsert-kafka connect in flink produces null in the final topic - apache-flink

I have a sql defined this way
CREATE TABLE raw_table
(
headers VARCHAR,
id VARCHAR,
type VARCHAR,
contentJson VARCHAR
) WITH (
'connector' = 'kafka',
'topic-pattern' = 'role__.+?',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'role_local_1',
'scan.startup.mode' = 'earliest-offset',
'format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
create view ROLES_NORMALIZED as
(
select
JSON_VALUE(contentJson, '$.id') as id,
rr.type as type
from raw_table rr
);
CREATE VIEW ROLES_UPSERTS_V1 AS
(
SELECT *
FROM ROLES_NORMALIZED
WHERE type in ('ROLE_CREATED', 'ROLE_UPDATED')
);
CREATE VIEW ROLES_DELETED_V1 AS
(
SELECT org,
pod,
tenantId,
id,
modified,
modified as deleted,
event_timestamp
FROM ROLES_NORMALIZED
WHERE type in ('ROLES_DELETED')
);
-------
CREATE TABLE final_topic
(
event_timestamp TIMESTAMP_LTZ,
id VARCHAR,
name VARCHAR,
deleted TIMESTAMP_LTZ,
PRIMARY KEY (pod, org, id) NOT ENFORCED
) WITH (
'connector' = 'upsert-kafka',
'topic' = 'final_topic',
'properties.bootstrap.servers' = 'localhost:29092,localhost:39092',
'properties.group.id' = 'some-group-id',
'value.format' = 'json',
'key.format' = 'json',
'properties.allow.auto.create.topics' = 'true',
'properties.replication.factor' = '3',
'value.json.timestamp-format.standard' = 'ISO-8601',
'sink.parallelism' = '3'
);
INSERT INTO final_topic
select
GREATEST(r.event_timestamp, d.event_timestamp) as event_timestamp,
r.id,
r.name,
d.deleted
from ROLES_UPSERTS_V1 r
LEFT JOIN ROLES_DELETED_V1 d
ON r.id = d.id;
The final_topic is produces the result i want to see, which is join of ROLES_UPSERTS_V1 and ROLES_DELETED_V1.
I tried this by publishing records to role__.+? topic.
What I am observing is that final-topic has null values as well. This is emitted when even the changelog kind happens to be -d -(DELETE). I understand the purpose as to why this exist. (here its saying the original message needs to be deleted and new one will follow). But I dont want such null values in my final-topic just the desired final state is this possible ?
Alternate that I am trying is to use Kafka connector. But the joins does not seems to work, as i get an error saying org.apache.flink.table.api.TableException: Table sink 'default_catalog.default_database.final_topic' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]. I get error when i use view for ROLES_UPSERTS_V1 and ROLES_DELETED_V1 . But if i have these as tables (with kafka connector) only inner join works ( left join does not work).

If you don't want null values, you can consider buffering records before you flush the results to the Upsert Kafka connector. See https://nightlies.apache.org/flink/flink-docs-stable/docs/connectors/table/upsert-kafka/#sink-buffer-flush-max-rows for more details
Regarding your join, as outlined in the docs For streaming queries, the grammar of regular joins is the most flexible and allow for any kind of updating (insert, update, delete) input table. Since you're running a streaming query, a future change could mean that the result of your join os an update or a delete. However, the sink that you're trying to emit to does not support this, hence the error.

Related

Flink: Temporal Join not emitting data

I'm trying to implement a event-time temporal join but I don't see any data being emitted from the join. I don't see any runtime exceptions either.
Flink Version: 1.13
Kafka topics have only 1 partition for now
Here's how I set it up:
I have an "append-only" DataStream (left input/probe side) which looks like the following:
{
"eventType": String,
"eventTime": LocalDateTime,
"eventId": String
}
So, I convert this datastream to a table before joining them:
var eventTable = tableEnv.fromDataStream(eventStream, Schema.newBuilder()
.column("eventId", DataTypes.STRING())
.column("eventTime", DataTypes.TIMESTAMP(3))
.column("eventType", DataTypes.STRING())
.watermark("eventTime", $("eventTime"))
.build());
Then, I have the "versioned table" (right input/build side) backed by Kafka (Debezium CDC changelog) which looks like the following:
CREATE TABLE metadata (
id VARCHAR,
eventMetadata VARCHAR,
origin_ts TIMESTAMP(3) METADATA FROM 'value.source.timestamp' VIRTUAL,
PRIMARY KEY (id) NOT ENFORCED,
WATERMARK FOR origin_ts AS origin_ts
) WITH (
'connector' = 'kafka',
'properties.bootstrap.servers' = 'SERVER_ADDR',
'properties.group.id' = 'SOME_GROUP',
'topic' = 'SOME_TOPIC',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'debezium-json'
)
The join query looks like this:
SELECT e.eventId, e.eventTime, e.eventType, m.eventMetadata
FROM events_view AS e
JOIN metadata_view FOR SYSTEM_TIME AS OF e.eventTime AS m
ON e.eventId = m.id
Following some other post on here, I've set the source idle-timeout:
table.exec.source.idle-timeout -> 5
And, I've also tried setting IdlenessTime on the watermarks to make sure source doesn't back emitting the watermarks. At this point I can see watermarks being generated, but I still don't get any results. Everything just ends up sitting on the Temporal Join table.
So, the problem here was the syntax of the processing time temporal join. Here's how to fix this:
// register the metadata table as a temporal table func by specifying its watermark and primary-key attributes
var metadataHistory = tableEnv.from("metadata")
.createTemporalTableFunction($("proc_time"), $("id"));
tableEnv.createTemporarySystemFunction("metadata_view", metadataHistory);
// sql processing time temporal join
var temporalJoinResult = tableEnv.sqlQuery("SELECT" +
" e.eventId, e.eventType, e.eventTime, m.eventMetadata" +
" FROM events_view AS e," +
" LATERAL TABLE (metadata_view(t.procTime)) AS m" +
" WHERE e.eventId = m.id");
Here, proc_time on metadata needs to be declared within the table DDL like this,
CREATE TABLE metadata (
id VARCHAR,
eventMetadata VARCHAR,
proc_time as PROCTIME(),
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'kafka',
'properties.bootstrap.servers' = 'SERVER_ADDR',
'properties.group.id' = 'SOME_GROUP',
'topic' = 'SOME_TOPIC',
'scan.startup.mode' = 'latest-offset',
'value.format' = 'debezium-json'
)
and while converting the datastream to table, assign the procTime there for that table as well like this,
var eventTable = tableEnv.fromDataStream(eventStream, Schema.newBuilder()
.column("eventId", DataTypes.STRING())
.column("eventTime", DataTypes.TIMESTAMP(3))
.column("eventType", DataTypes.STRING())
.columnByExpression("procTime", "PROCTIME()")
.build());

get columns other than primary key from CHANGETABLE function sql-server

I have two table (InvoiceRequests, InvoiceRequestsLineItems), InvoiceRequestsLineItems has InvoiceRequestId as forign key, Qauntity and Rate columns.
My both tables has change tracking enabled.
I'm writing a query that returns the Only the InvoiceRequests that has been changed with there respective Amount(calculated from Qauntity and Rate columns of InvoiceRequestsLineItems).
the scenario is when I update/insert into LineItems I Join the LineItems table and ChangeTable to get the InvoiceRequestId. So I shows customers that 'InvoiceRequest' Amount changed.
But when I delete any LineItems I lost the InvoiceRequestId(foreign key) from that row. Then I cant able to tell the amount has been changed for perticular InvoiceRequests.
Is there any way that ChangeTable function can return other columns(InvoiceRequestId) apart from PrimaryKey(InvoiceRequestLineItemId).
Solution like adding trigger and keep the deleted record in separate table will increase lot of overhead.
Please provide any suggestion, that I can do with minimal changes. Thanks
SELECT
CTIRL.INVOICEREQUESTSID [InvoiceRequestId],
IR.STATUS [Status],
ERIL.[InvoiceRequestAmount],
CAST(0 as BIT) [Deleted]
FROM
(SELECT [IRS].INVOICEREQUESTSID, CTL.SYS_CHANGE_COLUMNS, CTL.sys_change_operation
FROM INVOICEREQUESTLINEITEMS [IRS]
JOIN CHANGETABLE( CHANGES dbo.INVOICEREQUESTLINEITEMS, #ctversion) CTL
ON CTL.INVOICEREQUESTLINEITEMSID = [IRS].INVOICEREQUESTLINEITEMSID
) AS CTIRL
LEFT JOIN dbo.INVOICEREQUESTS IR
ON CTIRL.INVOICEREQUESTSID = IR.INVOICEREQUESTSID
LEFT JOIN (
SELECT
IRLI.INVOICEREQUESTSID,
SUM(
IRLI.QUANTITY * IRLI.RATE + COALESCE(IRLI.TAXAMOUNT, 0)
) [InvoiceRequestAmount]
FROM
INVOICEREQUESTLINEITEMS IRLI
GROUP BY
IRLI.INVOICEREQUESTSID
) AS ERIL
ON ERIL.INVOICEREQUESTSID = RES.InvoiceRequestId
WHERE
(
CHANGE_TRACKING_IS_COLUMN_IN_MASK(5, CTIRL.SYS_CHANGE_COLUMNS) = 1
OR CHANGE_TRACKING_IS_COLUMN_IN_MASK(7, CTIRL.SYS_CHANGE_COLUMNS) = 1
OR CHANGE_TRACKING_IS_COLUMN_IN_MASK(11, CTIRL.SYS_CHANGE_COLUMNS) = 1
OR CTIRL.sys_change_operation = 'D'
OR CTIRL.sys_change_operation = 'I'
)

SQL Server Cannot Update Table with Subqueries

I'm trying to update a temporary table called #deletedRecords which looks like this:
With the data from a table called log that looks like this:
The KeyValue in the log table is the same as the ID in #deletedRecords.
There is a column in #deletedRecords for every FieldName for any particular key value.
I tried to extract the values using the following query:
UPDATE #deletedRecords
SET PatientName = (SELECT ACL.OldValue WHERE ACL.FieldName = 'CptCode'),
ChargeNotes = (SELECT ACL.OldValue WHERE ACL.FieldName = 'ChargeNotes'),
Units = (SELECT ACL.OldValue WHERE ACL.FieldName = 'Units'),
ChargeStatusID = (SELECT ACL.OldValue WHERE ACL.FieldName = 'Units')
FROM Log ACL
JOIN #deletedRecords DR ON ACL.KeyValue = DR.ID
WHERE ACL.TableName = 'BillingCharge'
AND ACL.EventType = 'DELETE'
However when I run the query all of the columns to be updated in #deletedRecords are null. Can somebody please help explain what I'm missing?
Thanks in advance.
EDIT:
In response to #Yogesh Sharma's answer, I elected to use the CTE method. I would this that using the values from the CTE to join to additional tables and extract their values during the update.
e.g. The Log table doesn't contain an old value for the StatusName but it does contain the ChargeStatusID which could be used to join to another table that contains that information such as this table ChargeStatus:
Thus I modified #Yogesh Sharma's code to the following:
WITH cte AS
...
UPDATE d
SET d.PatientName = c.PatientName
, d.StatusName = cs.StatusName
FROM #deletedBillingChargeTemp d
JOIN cte c ON c.KeyValue = d.chargeID
JOIN ChargeStatus cs ON c.ChargeStatusID = cs.ChargeStatusID
However, once I add that secondary join, all of the updated values return to null as they were before #Yogesh Sharma's suggestions were implemented.
Your query does not work because the UPDATE is executed multiple times for each row in DR, considering only the conditions specified in the last three rows of your query (not the ones specified in the subqueries). The values that remain in the table are the ones that correspond to the ACL row used in the last execution (and the order of execution cannot be controlled). If for ACL row used in the last execution the subqueries return NULL, you will get a NULL result.
See the example in the https://learn.microsoft.com/en-us/sql/t-sql/queries/update-transact-sql topic, where it says "The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.".
You should rewrite your query like this:
UPDATE #deletedRecords
SET PatientName = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'CptCode' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
ChargeNotes = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'ChargeNotes' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
Units = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'Units' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
),
ChargeStatusID = (
SELECT ACL.OldValue FROM Log ACL
WHERE ACL.FieldName = 'Units' AND ACL.KeyValue = DR.ID
AND ACL.TableName = 'BillingCharge' AND ACL.EventType = 'DELETE'
)
FROM #deletedRecords DR
You would required to do some conditional aggregation for log table and do the JOINs in order to update the temporary table #deletedRecords records
So, the conditional approach could be achieve via CTE or Subquery
WITH cte AS
(
SELECT KeyValue,
MAX(CASE WHEN FieldName = 'CptCode' THEN OldValue END) PatientName,
MAX(CASE WHEN FieldName = 'ChargeNotes' THEN OldValue END) ChargeNotes,
...
FROM Log
WHERE TableName = 'BillingCharge' AND EventType = 'DELETE'
GROUP BY KeyValue
)
UPDATE d
SET d.PatientName = c.PatientName,
...
FROM #deletedRecords d
INNER JOIN cte c ON c.KeyValue = d.ID
The other way is to update your temporary table via correlation approach
UPDATE d
SET d.PatientName = (SELECT TOP 1 OldValue FROM Log WHERE KeyValue = d.ID AND
TableName = 'BillingCharge' AND EventType = 'DELETE' AND FieldName = 'CptCode'),
d.ChargeNotes= (SELECT TOP 1 OldValue FROM Log WHERE KeyValue = d.ID AND
TableName = 'BillingCharge' AND EventType = 'DELETE' AND FieldName = 'ChargeNotes'),
...
FROM #deletedRecords d
If your updated columns are NULL, these are it's possible causes:
Since you are doing a INNER JOIN, records might not be joining correctly by their joining column. Make sure both tables have the same values on the joining columns.
Since you are filtering in a WHERE clause, records might not fulfill your TableName and EventType filters. Make sure there are records that sucessfully INNER JOIN between them and they have the supplied TableName and EventType.
The values you are asigning are NULL. Make sure your subqueries return a not null value.
Table reference is off. When updating a table in SQL Server, always use the updating table alias if you are using one.
Use
UPDATE DR SET
YourColumn = Value
FROM
Log ACL
JOIN #deletedRecords DR ON -...
Instead of
UPDATE #deletedRecords SET
YourColumn = Value
FROM
Log ACL
JOIN #deletedRecords DR ON -...
Make sure you are NOT checking the variable table values on another batch, script or procedure. Variable tables scope are limited to current batch or procedure, while temporary tables remain as long as the session is alive.
Make sure that there isn't another statement that is setting those values as NULL after your update. Also keep an eye on your transactions (might not be commited or rolled back).

TSQL: Trying to turn joined select statement into update statement

Not the most complex problem but I've been looking at this one for a while and I'm at a block. Here is a TSQL select statement I want to turn into an update statement. The table structures are included for reference (server, db, and table names simplified from what they actually are).
SELECT m.ConfirmationCode as CodeInMain,
e.ConfirmationCode as CodeInEvents
FROM Server1.DB1.dbo.MainTable AS m
INNER JOIN Server2.DB1.dbo.Events AS e ON c.AccountCode = e.AccountCode
AND c.Commodity = (
SELECT Alias
FROM Server2.DB1.dbo.Commodities
WHERE pkCommodityId = e.fkCommodityId )
WHERE m.AccountCode = e.AccountCode
AND m.Brand IN( 'FTR', 'CER' );
Here are the referenced table schemas:
Server1.DB1.dbo.MainTable (ConfirmationCode varchar(100), AccountCode varchar(100), Commodity varchar(1), Brand varchar(3))
Server2.DB1.dbo.Events (ConfirmationCode varchar(100), AccountCode varchar(100), Commodity int)
Server2.DB1.dbo.Commodities (pkCommodityId int, Alias varchar(100))
Server 2 is linked to Server 1.
The current output of the select statement is:
CodeInMain | CodeInEvents
--------------------------
AN235cAc0a | NULL
CSORSX239c | NULL
...
All of my outputted information is as expected.
My goal is to update e.ConfirmationCode as CodeInEvents with the data in m.ConfirmationCode as CodeInMain but I am getting stuck on how to accommodate for the join. I realize that a cursor can be used to cycle through the contents of the above output when stored in a temporary table, but I would really like to be able to do this in a single update statement.
SQL Server has a non-ANSI extended UPDATE syntax that supports JOIN:
UPDATE e
SET e.ConfirmationCode = m.ConfirmationCode
FROM Server1.DB1.dbo.MainTable AS m
INNER JOIN Server2.DB1.dbo.Events AS e ON c.AccountCode = e.AccountCode
AND c.Commodity = (
SELECT Alias
FROM Server2.DB1.dbo.Commodities
WHERE pkCommodityId = e.fkCommodityId )
WHERE m.AccountCode = e.AccountCode
AND m.Brand IN( 'FTR', 'CER' );
got it.
Update e
set e.ConfirmationCode = m.ConfirmationCode
from Server2.DB1.dbo.Events e
inner join Server1.DB1.dbo.MainTable m on m.AccountCode = e.AccountCode and m.Commodity = (select Alias from Server2.DB1.dbo.Commodities where pkCommodityId = e.fkCommodityId)
where m.AccountCode = e.AccountCode and m.Brand in ('FTR', 'CER')

SQL Server adding its own aliases and getting them wrong

I wrote the following short, simple view:
--DROP VIEW ReportObjects.vStarts;
CREATE VIEW ReportObjects.vStarts AS
SELECT *
FROM ReportObjects.vLotHistory
WHERE
LotType = 'Some constant value'
AND CDOName = 'Some constant value'
AND SpecId = 'Some constant value'
When I execute this script then click "Design" on the view in SSMS, I get this:
SELECT ContainerName, SpecName, ProductName, LotStatus, CDOName, ResourceName AS LotType, EmployeeName AS WorkflowName, LotType AS DieQty,
WorkflowName AS LotQty, DieQty AS TxnDate, LotQty AS TxnDateGMT, TxnDate AS ContainerId, TxnDateGMT AS HistoryMainlineId, ContainerId AS ProductId,
HistoryMainlineId AS SpecId, ProductId AS WorkflowId, SpecId AS WorkflowBaseId, WorkflowId AS WorkflowStepId, WorkflowBaseId, WorkflowStepId, ResourceId,
EmployeeId
FROM ReportObjects.vLotHistory
WHERE (LotType = 'Some constant value') AND (CDOName = 'Some constant value') AND (SpecId = 'Some constant value')
See the problem? DieQty AS TxnDate, LotQty AS TxnDateGMT, TxnDate AS ContainerId
It's aliasing columnA with the name of columnC!
I've tried dropping/recreating the view several times.
I know it can be argued that SELECT * is icky, but that's beside the point (and it is necessary sometimes in production code).
The view above is selecting from another view defined as:
CREATE VIEW ReportObjects.vLotHistory AS
SELECT
lot.ContainerName,
hist.SpecName,
hist.ProductName,
lot.Status LotStatus,
hist.CDOName,
hist.ResourceName,
hist.EmployeeName,
lot.csiLotType LotType,
WorkFlowBase.WorkflowName,
hist.Qty DieQty,
hist.Qty2 LotQty,
hist.TxnDate,
hist.TxnDateGMT,
lot.ContainerId,
hist.HistoryMainlineId,
hist.ProductId,
hist.SpecId,
Workflow.WorkflowId,
Workflow.WorkflowBaseId,
hist.WorkflowStepId,
hist.ResourceId,
hist.EmployeeId
FROM ODS.MES_Schema.Container lot
LEFT JOIN ODS.MES_Schema.HistoryMainline hist
ON lot.ContainerId = hist.ContainerId
LEFT JOIN ODS.MES_Schema.WorkflowStep WS
ON hist.WorkflowStepId = WS.WorkflowStepId
LEFT JOIN ODS.MES_Schema.Workflow
ON WS.WorkflowId = Workflow.WorkflowId
LEFT JOIN ODS.MES_Schema.WorkflowBase
ON WorkflowBase.WorkflowBaseId = Workflow.WorkflowBaseId
Does anyone know why SSMS or SQL Server is eating my query and spitting out nonsense? Note that I am not using SSMS tools to create the view -- I am using CREATE VIEW as shown above.

Resources