Flink SQL-Cli: Hudi is abstract - apache-flink

i'm trying to recreate the flink common example working with hudi (https://hudi.apache.org/docs/flink-quick-start-guide), but when I try to insert the example data an error appears, can someone help me with this?
The steps that I'm following in my AWS EMR cluster are:
export JVM_ARGS=-Djava.io.tmpdir=/mnt/tmp
sudo aws s3 cp MyBucketLocation/hudi-flink-bundle_2.11-0.10.0.jar /lib/flink/lib/hudi-flink-bundle_2.11-0.10.0.jar
#Init the Sql cli flink
/usr/lib/flink/bin/sql-client.sh
--Create table
CREATE TABLE t1(
uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = 's3://issue-lmdl-s3-ldz/msk/Flink/kafka/',
'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
);
--Insert as the documentation
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
I'm working with EMR 6.8.0 and sql cli flink has already worked with kafka, I just want to write this records in hudi format.

Related

Flink SQL - Sinking records in a particular order

My team is using Flink SQL to build some of our pipelines.
For simplicity lets assume that there are 2 independent pipelines:
first one is listening to events in input_stream and storing enriched data in feature_jdbc_table and then emits an feature_updates event that the feature was updated
another one is listening to feature_updates and then when certain events come in fetches data from feature_jdbc_table to do calculations.
Here are sources/sinks definition:
CREATE TABLE input_stream (
user_id STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE feature_updates (
event_name STRING
) WITH (
'connector' = 'kafka',
-- ...
)
-- NEXT SQL --
CREATE TABLE feature_jdbc_table (
user_id STRING,
checked_at TIMESTAMP(3)
) WITH (
'connector' = 'jdbc',
-- ...
)
And here how sink statements look like
INSERT INTO feature_jdbc_table
SELECT user_id, event_time
FROM input_stream
-- NEXT SQL --
INSERT INTO feature_updates
SELECT 'my_feature_was_updated'
FROM input_stream
The issue is that we often run into race conditions when feature_updates event is created before feature_jdbc_table record is committed.
Is there way to emit feature_updates events after feature_jdbc_table records are committed?
I poked around with idea of reading data from feature_jdbc_table when emitting feature_updates but seems like that would require to configure CDC for the database otherwise Flink is treating feature_jdbc_table as "batch" source, not streaming one. And configuring CDC seems to be very involved.
Following code illustrates what I've tried to do
INSERT INTO feature_jdbc_table
SELECT user_id, event_time
FROM input_stream
-- NEXT SQL --
INSERT INTO feature_updates
SELECT 'my_feature_was_updated'
FROM feature_jdbc_table
I also tried to introduce a fixed delay (not great but at least something) with tumble windows.

Perform SCD2 on snowflake table based upon oracle input data

currently am sourcing data from oracle
As a part of intial load ingested all history data from oracle table oracle_a to snowflake table "snow_a" using named stage and copy into commands.
I would like to perform SCD2 on snow_a table based upon oracle_a table.
I mean if any new record added to Oracle_a table then that record to be inserted and any changes to existing record of oracle_a table ,
existing record of snow_a table to be expired and insert the record. Further details refer below image.
oracle_a table has key columns key_col1,key_col2,key_col3 as mentioned in below image. attr1 and attr2 are other attributes of the table enter image description here
Implementing SCD Type 2 functionality on a table in Snowflake is no different than in any other relational database. However, there is additional functionality that can help with this process. Please have a look at this blog post series on using Snowflake Streams and Tasks to perform the SCD logic.
https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-1/
Cheers,
Michael Rainey
Ok so here is what I found - though you may need to adjust were the update and insert come from - since oracle_a is not in Snowflake.
CREATE TABLE snowflake_a(key_col1 varchar(10), key_col2 varchar(10), key_col3 varchar(10), attr1 varchar(8), attr2 varchar(10), eff_ts TIMESTAMP, exp_ts TIMESTAMP, valid varchar(10));
DROP table oracle_a;
INSERT INTO snowflake_a VALUES('PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', current_date, current_date, 'Active');
CREATE TABLE oracle_a(key_col1 varchar(10), key_col2 varchar(10), key_col3 varchar(10), attr1 varchar(8), attr2 varchar(8), eff_ts TIMESTAMP, exp_ts TIMESTAMP);
INSERT INTO oracle_a
VALUES( 'PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', '10/24/2019', '12/31/1999');
UPDATE snowflake_a
SET valid = 'Expired'
WHERE valid LIKE '%Active%';
SELECT * FROM snowflake_a;
INSERT INTO snowflake_a VALUES( 'PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', '10/24/2019', '12/31/1999', 'Active');
SELECT * FROM snowflake_a;
Or better yet, what are us using to connect from your Oracle ecosystem to the Snowflake ecosystem?
From the question, it seems that the incoming Oracle rows do not contain any SCD2 type columns and that when each row inserted into snowflake is to be inserted using SCD2 type functionality.
SCD2 columns can have a specific meaning to the business, such that 'exp_ts' could be actual date or a business date. Snowflake 'Stage' does not include SCD2 functionality. This is usually the role of an ETL framework, not that of a 'fast/bulk' load utility.
Most ETL vendors have SCD2 functions as a part of their offering.
I did following steps to perform SCD2.
Loaded Oracle_a table data into TEMPORARY scd2_temp table
Performed update on snow_a to expire "changed records" by joining
key cols and checking the rest of attributes
Inserted into snow_a table from TEMPORARY scd2_temp to snow_a table
Here's a solution based on the following assumptions:
The source oracle table is not itself responsible for SCD2
processing (so Eff/Exp TS columns wouldn't be present on that
table).
There is an external process that is only Extracting/Loading delta
(new, updated) records into Snowflake.
The source oracle is not deleting records
First create the tables and add the first set of delta data:
CREATE OR REPLACE TABLE stg.cdc2_oracle_d (
key1 varchar(10),
key2 varchar(10),
key3 varchar(10),
attr1 varchar(8),
attr2 varchar(8));
CREATE OR REPLACE TABLE edw.cdc2_snowflake_d (
key1 varchar(10),
key2 varchar(10),
key3 varchar(10),
attr1 varchar(8),
attr2 varchar(8),
eff_ts TIMESTAMP_LTZ(0),
exp_ts TIMESTAMP_LTZ(0),
active_fl char(1));
INSERT INTO stg.cdc2_oracle_d VALUES
( 'PT_1', 'DL_1', 'RPT_1', 'Addr1a', 'APT_1.0'),
( 'PT_2', 'DL_2', 'RPT_2', 'Addr2a', 'APT_2.0'),
( 'PT_3', 'DL_3', 'RPT_3', 'Addr3a', 'APT_3.0');
Then run the following Transformation script:
BEGIN;
-- 1: insert new-new records from stg table that don't current exist in the edw table
INSERT INTO edw.cdc2_snowflake_d
SELECT
key1,
key2,
key3,
attr1,
attr2,
CURRENT_TIMESTAMP(0) AS eff_ts,
CAST('9999-12-31 23:59:59' AS TIMESTAMP) AS end_ts,
'Y' AS active_fl
FROM stg.cdc2_oracle_d stg
WHERE NOT EXISTS (
SELECT 1
FROM edw.cdc2_snowflake_d edw
WHERE edw.key1 = stg.key1
AND edw.key2 = stg.key2
AND edw.key3 = stg.key3
AND edw.active_fl = 'Y');
-- 2: insert new version of record from stg table where key current does exist in edw table
-- but only add if the attr columns are different, otherwise it's the same record
INSERT INTO edw.cdc2_snowflake_d
SELECT
stg.key1,
stg.key2,
stg.key3,
stg.attr1,
stg.attr2,
CURRENT_TIMESTAMP(0) AS eff_ts,
CAST('9999-12-31 23:59:59' AS TIMESTAMP) AS end_ts,
'T' AS active_fl -- set flat to Temporary setting
FROM stg.cdc2_oracle_d stg
JOIN edw.cdc2_snowflake_d edw ON edw.key1 = stg.key1 AND edw.key2 = stg.key2
AND edw.key3 = stg.key3 AND edw.active_fl = 'Y'
WHERE (stg.attr1 <> edw.attr1
OR stg.attr2 <> edw.attr2);
-- 3: deactive the current record where there is a new record from above step
-- and set the end_ts to 1 second prior to new record so there is no overlap in data
UPDATE edw.cdc2_snowflake_d old
SET old.active_fl = 'N',
old.exp_ts = DATEADD(SECOND, -1, new.eff_ts)
FROM edw.cdc2_snowflake_d new
WHERE old.key1 = new.key1
AND old.key2 = new.key2
AND old.key3 = new.key3
AND new.active_fl = 'T'
AND old.active_fl = 'Y';
-- 4: finally set all the temporary records to active
UPDATE cdc2_snowflake_d tmp
SET tmp.active_fl = 'Y'
WHERE tmp.active_fl = 'T';
COMMIT;
Review the results, then truncate & add new data and run the script again:
SELECT * FROM stg.cdc2_oracle_d;
SELECT * FROM edw.cdc2_snowflake_d ORDER BY 1,2,3,5;
TRUNCATE TABLE stg.cdc2_oracle_d;
INSERT INTO stg.cdc2_oracle_d VALUES
( 'PT_1', 'DL_1', 'RPT_1', 'Addr1a', 'APT_1.1'), -- record has updated attr2
( 'PT_2', 'DL_2', 'RPT_2', 'Addr2a', 'APT_2.0'), -- record has no changes
( 'PT_4', 'DL_4', 'RPT_4', 'Addr4a', 'APT_4.0'); -- new record
You'll see that PT_1 has 2 records w/ non-overlapping timestamps, only 1 is active.

Joining streaming data in Apache Spark

Apologies if title is too vague, but I had trouble to phrase it properly.
So basically I'm trying to figure out whether Apache Spark, together with Apache Kafka is able to sync data from my relational database to Elasticsearch.
My plan is to use one of the Kafka connectors to read data from RDBMS and push it into Kafka topics. That would be the ERD of the model and DDL. Quite basic, Report and Product tables that have many-to-many relationship that exists in ReportProduct table:
CREATE TABLE dbo.Report (
ReportID INT NOT NULL PRIMARY KEY,
Title NVARCHAR(500) NOT NULL,
PublishedOn DATETIME2 NOT NULL);
CREATE TABLE dbo.Product (
ProductID INT NOT NULL PRIMARY KEY,
ProductName NVARCHAR(100) NOT NULL);
CREATE TABLE dbo.ReportProduct (
ReportID INT NOT NULL,
ProductID INT NOT NULL,
PRIMARY KEY (ReportID, ProductID),
FOREIGN KEY (ReportID) REFERENCES dbo.Report (ReportID),
FOREIGN KEY (ProductID) REFERENCES dbo.Product (ProductID));
INSERT INTO dbo.Report (ReportID, Title, PublishedOn)
VALUES (1, N'Yet Another Apache Spark StackOverflow question', '2017-09-12T19:15:28');
INSERT INTO dbo.Product (ProductID, ProductName)
VALUES (1, N'Apache'), (2, N'Spark'), (3, N'StackOverflow'), (4, N'Random product');
INSERT INTO dbo.ReportProduct (ReportID, ProductID)
VALUES (1, 1), (1, 2), (1, 3), (1, 4);
SELECT *
FROM dbo.Report AS R
INNER JOIN dbo.ReportProduct AS RP
ON RP.ReportID = R.ReportID
INNER JOIN dbo.Product AS P
ON P.ProductID = RP.ProductID;
My goal is to transform this into document with the following structure:
{
"ReportID":1,
"Title":"Yet Another Apache Spark StackOverflow question",
"PublishedOn":"2017-09-12T19:15:28+00:00",
"Product":[
{
"ProductID":1,
"ProductName":"Apache"
},
{
"ProductID":2,
"ProductName":"Spark"
},
{
"ProductID":3,
"ProductName":"StackOverflow"
},
{
"ProductID":4,
"ProductName":"Random product"
}
]
}
I was able to form such kind of structure using static data that I have mocked up locally:
report.join(
report_product.join(product, "product_id")
.groupBy("report_id")
.agg(
collect_list(struct("product_id", "product_name")).alias("product")
), "report_id").show
But I realize that this is too basic and streams are going to be way more complicated.
Data is changing irregularly, reports and their products are being constantly changed, products are changed once in a while (mostly on a weekly basis).
I would like to replicate any kind of changes into Elasticsearch that have happened in one of these tables.
Kafka Connect to pull the data from your source DB - you can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Once you've got the data in Kafka, use either the Kafka Streams API to manipulate the data as desired, or look at the newly released KSQL. Which you choose will be driven by things like your preference for coding in Java (with Kafka Streams) or manipulating data in a SQL-like environment (with KSQL). Regardless, the output of both of these is going to be another Kafka topic.
Finally, stream the Kafka topic from above into Elasticsearch using the Elasticsearch Kafka Connect plugin (available here, or as part of the Confluent Platform)

Execute postgres trigger after 24 hours after record insertion?

I`m trying to execute trigger after 24 hours after record insertion, for each row, how to do this? please help
you know, if user doesn`t verificate his email, his account will be deleted.
Without cron and so on.
Postgres db. PostgreSQL
This may help In relation to scheduling an sql command to run at a certain time.
PGagent
This is the command I tested and you could use this in PGagent.
delete from email_tbl where email_id in(select email_id from email_tbl
where timestamp < now() - '1 day'::interval );
Here is the test data i used.
CREATE EXTENSION citext;
CREATE DOMAIN email_addr AS citext
CHECK(
VALUE ~ '^[A-Za-z0-9._%-]+#[A-Za-z0-9.-]+[.][A-Za-z]+$'
);
CREATE TABLE email_tbl (
email_id SERIAL PRIMARY KEY,
email_addr email_addr NOT NULL UNIQUE,
timestamp timestamp default current_timestamp
);
And here's some test data
insert into email_tbl (email_addr) values('me#home.net')
insert into email_tbl (email_addr,timestamp)
values('me2#home.net','2015-07-15 00:00:00'::timestamp)
select * from email_tbl where timestamp < now() - '1 day'::interval
All the best

SSIS - How to Identify which package a row in the log table is referring to?

I have multiple SSIS integration packages logging to a database. They all write to the table sysssislog.
I want a stored procedure to be able to return the success of the last run of a selected package.
How do I identify a package in sysssislog? The executionid field would seem to work, but it seems like it's changing values on mosts runs of the same package (sometimes it stays the same). Is there some way to know which package a log entry is coming from?
Structure of sysssislog for reference:
CREATE TABLE [dbo].[sysssislog](
[id] [int] IDENTITY(1,1) NOT NULL,
[event] [sysname] NOT NULL,
[computer] [nvarchar](128) NOT NULL,
[operator] [nvarchar](128) NOT NULL,
[source] [nvarchar](1024) NOT NULL,
[sourceid] [uniqueidentifier] NOT NULL,
[executionid] [uniqueidentifier] NOT NULL,
[starttime] [datetime] NOT NULL,
[endtime] [datetime] NOT NULL,
[datacode] [int] NOT NULL,
[databytes] [image] NULL,
[message] [nvarchar](2048) NOT NULL,
Like the original poster, I wanted to see the name of my package in front of all of my source names when going through my SSIS log. In reading William's response, I realized the ExecutionID could be leveraged to do exactly that, at least when using the SSIS log provider for SQL Server.
If you're using SQL Server 2008 and your SSIS logging table uses the standard name of "sysssislog", then try this query:
SELECT s1.id, s1.operator, s1.event, s2.source package_name, s1.source,
CONVERT(varchar, s1.starttime, 120) AS starttime, s1.message, s1.datacode
FROM dbo.sysssislog AS s1 LEFT OUTER JOIN
dbo.sysssislog AS s2 ON s1.executionid = s2.executionid
WHERE s2.event = 'PackageStart'
ORDER BY s1.id
Notes:
The convert statement trims off the fractional seconds.
Adjust the table name if you're using SQL Server 2005.
I used the query to create a view I called "SSIS_Log_View"
I hope that helps.
Here's a nice view candidate to take a look at the history of execution of all the packages in your SSIS, you can also see how long a package was running in minutes:
select
source [PackageName], s.executionid
, min(s.starttime) StartTime
, max(s.endtime) EndTime
, (cast(max(s.endtime) as float) - cast(min(s.starttime) as float))*24*60 DurationInMinutes
from dbo.sysssislog as s
where event in ('PackageStart', 'PackageEnd')
--and source = 'foobar'
group by s.source, s.executionid
order by max(s.endtime) desc
Take a look if this helps you, from Books On Line
source nvarchar
The name of the executable, in the package, that generated the logging entry.
sourceid uniqueidentifier
The GUID of the executable in the package that generated the logging entry.
The column "sourceid" would be the same as your SSIS package GUID for the events
PackageStart
PackageEnd
As said above - the executionid is the guid of the particular instance of the run.
You may want to enable "OnError" event handler in order to produce the package that didn't fail.
To generate the report what you can do :
join msdb.[dbo].[sysdtspackages] and dbo.sysssislog table on id = sourceid. The packages that failed will have OnError entry in sysssislog table from which you can infer the status.
--
Please mark if this answers your question
Source ID column where the event is "Package Start" identifies the package Name. The Execution ID ties in all of the related rows for that instance of your package run.
Source ID can be tied back to your development of your package by opening your package and looking at the ID field in your package level properties. This GUID matches your package level source ID column in the log. Each object in your package will also have its own GUID and these can be seen in the log.
In case you want to monitor the execution during running the project, you will have to use a more sophisticated query like:
SELECT MIN(A.ID) AS ID,
A.Source
, MIN(A.StartTime) AS StartTime
, case when MAX(B.endtime) IS NULL then null else round((cast(MAX(B.endtime) as float) - cast(MIN(A.starttime) as float))*24*60*60, 0) end AS Seconds
, C.Message
FROM (
SELECT ID, Source, StartTime, ExecutionID
FROM SysSSiSLog
WHERE Event = 'PackageStart'
) A
LEFT OUTER JOIN (
SELECT Source, EndTime, ExecutionID
FROM SysSSiSLog
WHERE Event = 'PackageEnd'
) B
ON A.Source = B.Source AND A.ExecutionID = B.ExecutionID
LEFT OUTER JOIN (
SELECT distinct Source, EndTime, ExecutionID, Message
FROM SysSSiSLog
WHERE Event = 'OnError'
) C
ON A.Source = C.Source AND A.ExecutionID = C.ExecutionID
WHERE A.ID >= (SELECT MAX(ID) FROM SysSSiSLog WHERE Source = 'Main' AND Event = 'PackageStart')
GROUP BY A.Source, A.ExecutionID, C.Message
ORDER BY min(A.ID) Asc

Resources