I am trying to use Apache Hudi with Flink sql by following Hudi's flink guide
The basics are working, but now I need to provide custom implementation of HoodieRecordPayload as suggested on this FAQ.
But when I am passing this config as shown in following listing, it doesn't work. Basically my custom class (MyHudiPoc.Poc) doesn't get picked and instead normal behaviour continues.
CREATE TABLE t1(
uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
name VARCHAR(10),
age INT,
ts TIMESTAMP(3),
`partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
'connector' = 'hudi',
'path' = '/tmp/hudi',
'hoodie.compaction.payload.class' = 'MyHudiPoc.Poc', -- My custom class
'hoodie.datasource.write.payload.class' = 'MyHudiPoc.Poc', -- My custom class
'write.payload.class' = 'MyHudiPoc.Poc', -- My custom class
'table.type' = 'MERGE_ON_READ'
);
INSERT INTO t1 VALUES
('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
insert into t1 values
('id1','Danny1',27,TIMESTAMP '1970-01-01 00:00:01','par1');
I even tried passing it through /etc/hudi/conf/hudi-default.conf
---
"hoodie.compaction.payload.class": MyHudiPoc.Poc
"hoodie.datasource.write.payload.class": MyHudiPoc.Poc
"write.payload.class": MyHudiPoc.Poc
I am also passing my custom jar while starting flink sql client.
/bin/sql-client.sh embedded \
-j ../jars/hudi-flink1.15-bundle-0.12.1.jar \
-j ./plugins/flink-s3-fs-hadoop-1.15.1.jar \
-j ./plugins/parquet-hive-bundle-1.8.1.jar \
-j ./plugins/flink-sql-connector-kafka-1.15.1.jar \
-j my-hudi-poc-1.0-SNAPSHOT.jar \
shell
I am able to pass my custom class in spark example but not in flink.
Tried with both COW and MOR type of tables.
Any idea what I am doing wrong?
See listing in the question.
Related
Hi I have declared a table like this
create or replace table app_event (
ID varchar(36) not null primary key,
VERSION number,
ACT_TYPE varchar(255),
EVE_TYPE varchar(255),
CLI_ID varchar(36),
DETAILS variant,
OBJ_TYPE varchar(255),
DATE_TIME timestamp,
AAPP_EVENT_TO_UTC_DT timestamp,
GRO_ID varchar(36),
OBJECT_NAME varchar(255),
OBJ_ID varchar(255),
USER_NAME varchar(255),
USER_ID varchar(255),
EVENT_ID varchar(255),
FINDINGS varchar(255),
SUMMARY variant
);
DETAILS column will contain xml file so that i can run xml function and get element of that xml file .
My sample rows looks like this
dfjkghdfkjghdf8gd7f7997,0,TEST_CASE,CHECK,74356476476DFD,<?xml version="1.0" encoding="UTF-8"?><testPayload><testId>3495864795uiyiu</testId><testCode>COMPLETED</testCode><testState>ONGOING</testState><noOfNewTest>1</noOfNewTest><noOfReviewRequiredTest>0</noOfReviewRequiredTest><noOfExcludedTest>0</noOfExcludedTest><noOfAutoResolvedTest>1</noOfAutoResolvedTest><testerTypes>WATCHLIST</testerTypes></testPayload>,CASE,41:31.3,NULL,948794853948dgjd,(null),dfjkghdfkjghdf8gd7f7997,test user,dfjkghdfkjghdf8gd7f7997,NULL,(null),(null)
When i declare DETAILS as varchar i am able to load file but when i declare this as variant i get below error for that column only
Error parsing JSON:
dfjkghdfkjghdf8gd7f7997COMPLETED</status
File 'SNOWFLAKE/Sudarshan.csv', line 1, character 89 Row 1, column
"AUDIT_EVENT"["DETAILS":6]
Can you please help on this ?
I can not use varchar as i need to query element of xml also in my query .
This is how i load into table and i use default CSV format ,file is available in S3 .
COPY INTO demo_db.public.app_event
FROM #my_s3_stage/
FILES = ('app_Even.csv')
file_format=(type='CSV');
Based on Answer this is how i am loading
copy into demo_db.public.app_event from (
select
$1,$2,$3,$4,$5,
parse_xml($6),$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,parse_xml($17)
from #~/Audit_Even.csv d
)
file_format = (
type = CSV
)
But when i execute it says zero row processed and no mentioned stage here
If you are using a COPY INTO statement then you need to put in a subquery to convert the data before loading it into the table. Use the parse_xml within your copy statement's subquery, something like this:
copy into app_event from (
select
$1,
parse_xml($2) -- <---- "$2" is the column number in the CSV that contains the xml
from #~/test.csv.gz d -- <---- This is my own internal user stage. You'll need to change this to your external stage or whatever
)
file_format = (
type = CSV
)
It is hard to provide you with a good SQL statement without a full example of your existing code (your copy / insert statement). In my example above, I'm copying a file in my own user stage (#~/test.csv.gz) with the default CSV file format options. You are likely using an external stage but it should be easy to adapt this to your own example.
I have read the link Table options do not contain an option key 'connector'
it said we should set the format.
But My Scene is datagen->hive.
Here's my completed example(it's wrong Now)
drop table if exists datagen;
CREATE TABLE datagen (
f_sequence INT,
f_random INT,
f_random_str STRING,
ts AS localtimestamp,
WATERMARK FOR ts AS ts
) WITH (
'connector' = 'datagen',
-- optional options --
'rows-per-second'='5',
'fields.f_sequence.kind'='sequence',
'fields.f_sequence.start'='1',
'fields.f_sequence.end'='50',-- 这个地方限制了一共会产生的条数
'fields.f_random.min'='1',
'fields.f_random.max'='50',
'fields.f_random_str.length'='10'
);
SET table.sql-dialect=hive;
drop table if exists hive_table;
CREATE TABLE hive_table (
f_sequence INT,
f_random INT,
f_random_str STRING
) PARTITIONED BY (dt STRING, hr STRING, mi STRING) STORED AS parquet TBLPROPERTIES (
'partition.time-extractor.timestamp-pattern'='$dt $hr:$mi:00',
'sink.partition-commit.trigger'='partition-time',
'sink.partition-commit.delay'='1 min',
'sink.partition-commit.policy.kind'='metastore,success-file'
);
Flink SQL> insert into hive_table select f_sequence,f_random,f_random_str ,DATE_FORMAT(ts, 'yyyy-MM-dd'), DATE_FORMAT(ts, 'HH') ,DATE_FORMAT(ts, 'mm') from datagen;
[INFO] Submitting SQL update statement to the cluster...
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.table.api.ValidationException: Table options do not contain an option key 'connector' for discovering a connector.
Is the solution from above link suitable for this case?
Need your help,Thanks~!
please use SET table.sql-dialect=default; before call insert into hive_table..., the statement insert into hive_table... using datagen connector which hive dialect should not support.
I'm running some queries using PostgreSQL for student purposes. Now I need to encrypt some fields using AES256 to run the same queries and compare the times. Any idea how this can be done using UPDATE table? For example, I need to encrypt the customer address of the table address. Can I do this using UPDATE? Can anyone give me an example? Couldn't find anything online. Thanks.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
INSERT INTO table(name,age) VALUES(
PGP_SYM_ENCRYPT('John', 'AES_KEY'),
PGP_SYM_ENCRYPT('22', 'AES_KEY')
);
UPDATE table SET
(name,age) = (
PGP_SYM_ENCRYPT('Jona', 'AES_KEY'),
PGP_SYM_ENCRYPT('15','AES_KEY')
) WHERE id='1';
SELECT
PGP_SYM_DECRYPT(name::bytea, 'AES_KEY') as name
PGP_SYM_DECRYPT(age::bytea, 'AES_KEY') as age
FROM table WHERE(
LOWER(PGP_SYM_DECRYPT(name::bytea, 'AES_KEY')
LIKE LOWER('%John%')
);
I have model I created on the fly for peewee. Something like this:
class TestTable(PeeweeBaseModel):
whencreated_dt = DateTimeField(null=True)
whenchanged = CharField(max_length=50, null=True)
I load data from a text file to a table using peewee, the column "whenchanged" contains all dates in a format of '%Y-%m-%d %H:%M:%S' as varchar column. Now I want to convert the text field "whenchanged" into a datetime format in "whencreated_dt".
I tried several things... I ended up with this:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
which fails with a "TypeError: strptime() argument 1 must be str, not CharField": I'm trying to convert "whencreated" to datetime and then assign it to "whencreated_dt".
I tried a variation... following e.g. works without a hitch:
# Initialize table to TestTable
to_execute = "table.update({table.%s : datetime.now()}).execute()" % (self.name)
exec(to_execute)
But this is of course just the current datetime, and not another field.
Anyone knows a solution to this?
Edit... I did find a workaround eventually... but I'm still looking for a better solution... The workaround:
all_objects = table.select()
for o in all_objects:
datetime_str = getattr( o, 'whencreated' )
setattr(o, 'whencreated_dt', datetime.strptime(datetime_str, '%Y-%m-%d %H:%M:%S'))
o.save()
Loop over all rows in the table, get the "whencreated". Convert "whencreated" to a datetime, put it in "whencreated_dt", and save each row.
Regards,
Sven
Your example:
to_execute = "table.update({table.%s : datetime.strptime(table.%s, '%%Y-%%m-%%d %%H:%%M:%%S')}).execute()" % ('whencreated_dt', 'whencreated')
Will not work. Why? Because datetime.strptime is a Python function and operates in Python. An UPDATE query works in database-land. How the hell is the database going to magically pass row values into "datetime.strptime"? How would the db even know how to call such a function?
Instead you need to use a SQL function -- a function that is executed by the database. For example, Postgres:
TestTable.update(whencreated_dt=whenchanged.cast('timestamp')).execute()
This is the equivalent SQL:
UPDATE test_table SET whencreated_dt = CAST(whenchanged AS timestamp);
That should populate the column for you using the correct data type. For other databases, consult their manuals. Note that SQLite does not have a dedicated date/time data type, and the datetime functionality uses strings in the Y-m-d H:M:S format.
I'm using BCP to export and import data but it seems that SQLNUMERIC or SQLDECIMAL data types are not supported. Exporting seems to be fine:
-- hit alt+Q then M to enable SQLCMD mode
use tempdb
go
create table mytest (a decimal);
insert mytest values (3.3);
-- export to c drive
!!bcp "tempdb..mytest" out "c:\mytest.dat" -T -n -S"YourServer\YourInstance"
!!bcp "tempdb..mytest" format nul -T -n -f "c:\mytest.fmt" -S"YourServer\YourInstance"
GO
That works okay, but when I then go to import the data back (like this):
SELECT a.*
FROM OPENROWSET(
BULK 'C:\mytest.dat',
FORMATFILE = 'C:\mytest.fmt') AS a
I get the error message:
Msg 4838, Level 16, State 1, Line 2
The bulk data source does not support the SQLNUMERIC or SQLDECIMAL data types.
Question How can I import numeric data that was exported using BCP?
I have control over the bcp commands shown in this question, but not the table definitions. A T-SQL only solution is preferred.
Instead of using the "native" format, I tried with the character one ("-c") and it worked. The modified script which I used was:
use tempdb
go
create table mytest (id int, t varchar(12), a decimal(18,2), c char);
insert mytest values (1, 'test1', 3.6, 'a');
insert mytest values (2, 'test3', 3.3, 'b');
go
!!bcp "tempdb..mytest" out "d:\temp\mytest.dat" -T -c -S"localhost"
!!bcp "tempdb..mytest" format nul -T -c -f "d:\temp\mytest.fmt" -S"localhost"
GO
select * from mytest
SELECT a.*
FROM OPENROWSET(
BULK 'd:\temp\mytest.dat',
FORMATFILE = 'd:\temp\mytest.fmt') AS a
I am not sure if it's feasible in your case but you can give it a try.