I have to implement a use case where I have to insert data (source is a Kafka topic) into multiple postgres tables (around 10 tables) in a transactional way i.e either all table inserts happen or if there is a failure in one insert, all the inserts for that particular record are failed. The insert failed records should also be captured and written into another Kafka topic.
Based on my understanding of the JDBC sink implementation, we can only provide one prepared statement per sink. Also, according to the invoke method signature of the GenericJdbcSinkFunction class -
public void invoke(T value, SinkFunction.Context context) throws IOException
It only throws an IOException. Is it possible to catch a SQL Exception having a failed insert and then writing that record in a separate Kafka topic? If yes, what happens to the rest of the records in that batch because if one record insert fails, I believe the whole batch fails.
Is it a good idea to use Flink for such use case?
Related
I have created new dataset using snowflake connector and used the same as source dataset in lookup activity.
Then I am trying to INSERT the record into snowflake using following query.
'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST'-- (all values are passed)
Result: The row getting inserted into snowflake but my pipeline got failed stating the below error.
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcInvalidQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The following ODBC Query is not valid: 'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST');'
Could you please share you advise or anylead to solve this problem.
Thanks.
Rajesh
Lookup, as the name suggests, is for searching and retrieving data, not for inserting. However, you can enclose your INSERT code in a procedure and execute it using the Lookup activity.
However, I strongly do not recommend such an action, remember that when inserting data into Snowflake, you create at least one micro-partition with a size of 16MB, if you insert one line at a time, the performance will be terrible and the data will take up a disproportionate amount of space. Remember Snowlfake is not a transaction database (! OLTP).
Instead, it's better to save all the records in an intermediate file and then import the entire file in one move.
You can use the lookup activity to perform operations other than selects, it just HAS to have an output. I've gotten around it with a postgres database doing create tables, truncates, one off inserts by just concatenating a
select current_date;
after the main query.
Note, the sql script activity will definitely be better for this, we are waiting on postgres support in that though.
I am reading at https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/upsert-kafka/.
It says that:
As a sink, the upsert-kafka connector can consume a changelog stream.
It will write INSERT/UPDATE_AFTER data as normal Kafka messages value,
and write DELETE data as Kafka messages with null values (indicate
tombstone for the key).
It doesn't mention that if UPDATE_BEFORE message is written to upsert kafka,then what would happen?
In the same link (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/upsert-kafka/#full-example), the doc provides a full example:
INSERT INTO pageviews_per_region
SELECT
user_region,
COUNT(*),
COUNT(DISTINCT user_id)
FROM pageviews
GROUP BY user_region;
With the above INSERT/SELECT operation, INSERT/UPDATE_BEFORE/UPDATE_AFTER messages will be generated and will go to the upsert kafka sink, I would ask what happens when upsert kafka meets the UPDATE_BEFORE message.
From the comments on the source code
/ / partial code
// During the Upsert mode during the serialization process, if the operation is executed is Rowkind.delete or Rowkind.Update_before
// set it to NULL (corresponding to Kafka tomb news)
https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=165221669#content/view/165221669
Upsert-kafka sink doesn’t require planner to send UPDATE_BEFORE messages (planner may still send UPDATE_BEFORE messages in some cases), and will write INSERT/UPDATE_AFTER messages as normal Kafka records with key parts, and will write DELETE messages as Kafka records with null values (indicate tombstone for the key). Flink will guarantee the message ordering on the primary key by partition data on the values of the primary key columns.
Upsert-kafka source is a kind of changelog source. The primary key semantics on changelog source means the materialized changelogs (INSERT/UPDATE_BEFORE/UPDATE_AFTER/DELETE) are unique on the primary key constraints. Flink assumes all messages are in order on the primary key.
Implementation Details
Due to the upsert-kafka connector only produces upsert stream which doesn’t contain UPDATE_BEFORE messages. However, several operations require the UPDATE_BEFORE messages for correctly processing, e.g. aggregations. Therefore, we need to have a physical node to materialize the upsert stream and generate changelog stream with full change messages. In the physical operator, we will use state to know whether the key is the first time to be seen. The operator will produce INSERT rows, or additionally generate UPDATE_BEFORE rows for the previous image, or produce DELETE rows with all columns filled with values.
I am new to flink and want to understand how to run my use case with FLINK:
Application has three input data source
a) historical data
b) get all the live events from kafka
c) get the control event that will have a trigger condition
since the application is dealing with historical data so I thought that I will merge historical data and live data and will create a table on that stream.
To trigger the event we have to write the SQL query with help of control event that is the input source and that holds the where clause.
My problem is to build the SQL query as data is in Stream and when I do something like
DataStream<ControlEvent> controlEvent
controlEvent.map(new FlatMapFunction(String, String)
{
#override
public String flatMap(String s, Collector<String> coll)
{
tableEnv.execute("select * from tableName"); /// throw serialization exception
}
});
it throws not serialization exception Localexecutionenvironment
That sort of dynamic query injection is not (yet) supported by Flink SQL.
Update:
Given what you've said about your requirements -- that the variations in the queries will be limited -- what you might do instead is to implement this using the DataStream API, rather than SQL. This would probably be a KeyedBroadcastProcessFunction that would hold some keyed state and you could broadcast in the updates to the query/queries.
Take a look at the Fraud Detection Demo as an example of how to build this sort of thing with Flink.
I have a table dbo.RawMessage, which allows anther system to frequently(insert 2 records per second) insert data.
I need to process the data in the RawMessage, and put the processed data in dbo.ProcessedMessage.
Because the processing logic is not very complected, so my approach is add a insert trigger in the RawMessage table, but sometime I got deadlock.
I am using SQL SERVER EXPRESS
My questions:
1.Is this a stuipid approach?
2.If not, how to improve?
3.If yes, please guide me the graceful way
I'm using a OLE DB Destination to populate a table with value from a webservice.
The package will be scheduled to run in the early AM for the prior day's activity. However, if this fails, the package can be executed manually.
My concern is if the operator chooses a date range that over-laps existing data, the whole package will fail (verified).
I would like it:
INSERT the missing values (works as expected if no duplicates)
ignore the duplicates; not cause the package to fail; raise an exception that can be captured by the windows application log (logged as a warning)
collect the number of successfully-inserted records and number of duplicates
If it matters, I'm using Data access mode = Table or view - fast load and
Suggestions on how to achieve this are appreciated.
That's not a feature.
If you don't want error (duplicates), then you need to defend against it - much as you'd do in your favorite language. Instead of relying on error handling, you test for the existence of the error inducing thing (Lookup Transform to identify existence of row in destination) and then filter the duplicates out (Redirect No Match Output).
The technical solution you absolutely should not implement
Change the access mode from the "Table or View Name - Fast Load" to "Table or View Name". This changes the method of insert from a bulk/set-based operation to singleton inserts. By inserting one row at a time, this will allow the SSIS package to evaluate the success/fail of each row's save. You then need to go into the advanced editor, your screenshot, and change the Error disposition from Fail Component to Ignore Failure
This solution should not used as it yields poor performance, generates unnecessary work load and has the potential to mask other save errors beyond just "duplicates" - referential integrity violations for example
Here's how I would do it:
Point your SSIS Destination to a staging table that will be empty
when the package is run.
Insert all rows into the staging table.
Run a stored procedure that uses SQL to import records from the
staging table to the final destination table, WHERE the records don't
already exist in the destination table.
Collect the desired meta-data and do whatever you want with it.
Empty the staging table for the next use.
(Those last 3 steps would all be done in the same stored procedure).