I am working on a Flink streaming job where I need to upsert data in the Hudi table. I am using merge into a query to upsert data in the Hudi table.
Table table = tableEnv.fromDataStream(KafkaStreamTableDataStreamStream);
tableEnv.createTemporaryView("table1", table);
tableEnv.executeSql("Merge into target " +
"USING table1 s0 " +
"ON target.id = s0.id " +
"WHEN MATCHED THEN UPDATE SET amount=s0.amount");
This query is working fine in spark-shell. But it is giving me Exception in thread "main" org.apache.flink.table.api.TableException: Unsupported query: Merge into .. in flink
Do merge into statement query work in the Flink job?
Flink doesn't support MERGE statements. This has been brought up for discussion but nothing has happened since then.
Related
Is it possible to have more than one INSERT INTO ... SELECT ... statement within a single PyFlink job (on Flink 1.13.6)?
I have a number of output tables that I create and I am trying to write to write to these within a single job, where the example Python & SQL looks like (assume there is an input table called 'input'):
sql1 = "INSERT INTO out1 (col1, col2) SELECT col1, col2 FROM input"
sql2 = "INSERT INTO out2 (col3, col4) SELECT col3, col4 FROM input"
env.execute_sql(sql1)
env.execute_sql(sql2)
When this is run inside a Flink cluster inside Kinesis on AWS, I get a failure:
Cannot have more than one execute() or executeAsync() call in a single
environment.
When I look at the Flink web UI, I can see that there is one job called insert-into_default_catalog.default_database.out1. Does Flink separate out each INSERT statement into a separate job? It looks like it tries to create one job for the first query and then fails to create a second job for the second query.
Is there any way of getting it to run as a single job using SQL, without having to move away from SQL and the Table API?
If you want to do multiple INSERTs, you need to wrap them in a statement set:
stmt_set = table_env.create_statement_set()
# only single INSERT query can be accepted by `add_insert_sql` method
stmt_set.add_insert_sql(sql1)
stmt_set.add_insert_sql(sql2)
# execute all statements together
table_result = stmt_set.execute()
# get job status through TableResult
print(table_result.get_job_client().get_job_status())
See the docs for more info.
I have been reading the documentation of Cassandra Db on datastax as well as Apache docs. So far I have learned that we cannot create more than one index (one primary, one secondary index) on a table. And there should be an individual table for each query.
Comparing this to an SQL table for example one on which we want to query 4 fields, for this table in case of Cassandra we should split this table into 4 tables right? ( please correct me if I am wrong ).
on these 4 tables I can have the indexes and make the queries,
My question is How can we insert data into these 4 tables, should I make 4 consecutive insert requests?
my priority is to avoid secondary index
To keep data in sync across denormalised tables, you need to use CQL BATCH statements.
For example, if you had these tables to maintain:
movies
movies_by_actor
movies_by_genre
then you would group the updates in a CQL BATCH like this:
BEGIN BATCH
INSERT INTO movies (...) VALUES (...);
INSERT INTO movies_by_actor (...) VALUES (...);
INSERT INTO movies_by_genre (...) VALUES (...);
APPLY BATCH;
Note that it is also possible to do UPDATE and DELETE statements as well as conditional writes in a batch.
The above example is just to illustrate it in cqlsh and is not used in reality. Here is an example BatchStatement using the Java driver:
SimpleStatement insertMovies =
SimpleStatement.newInstance(
"INSERT INTO movies (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByActor =
SimpleStatement.newInstance(
"INSERT INTO movies_by_actor (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByGenre =
SimpleStatement.newInstance(
"INSERT INTO movies_by_genre (...) VALUES (?, ...)", <some_values>);
BatchStatement batch =
BatchStatement.builder(DefaultBatchType.LOGGED)
.addStatement(insertMovies)
.addStatement(insertMoviesByActor)
.addStatement(insertMoviesByGenre)
.build();
For details, see Java driver Batch statements. Cheers!
Cassandra supports secondary Index, SSTable Attached Secondary Index(SASI). Storage Attached Indexes (SAI) has been donated to the project but not yet accepted.
You need to create your tables such that you can get all your required data from table using a single query which looks something like this
SELECT * from keyspace.table_name where key = 'ABC';
So what it means to a as a designer. You need to get all queries identified and on the basis of those queries you define your data model (tables). So if you think you will need 4 tables to satisfy your queries then you are right.
Since all the 4 tables defined by you will have to be in sync if they represent same data, best way is to use batch
BEGIN BATCH
DML_statement1 ;
DML_statement2 ;
DML_statement3 ;
DML_statement4 ;
APPLY BATCH ;
Batch does not guarantee that all statements will be successful are rolled back. It informs the client that group of statements has failed. So client should retry to apply them.
Better to avoid secondary indexes if you can because of performance issues with them. A general rule of thumb is to index a column with low cardinality of few values.
given below is the create statement for the table I created using flink.
CREATE TABLE event_kafkaTable (
columnA string,
columnB string,
timeofevent string,
eventTime AS TO_TIMESTAMP(TimestampConverterUtil(timeofevent)),
WATERMARK FOR eventTime AS eventTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'event_name',
'properties.bootstrap.servers'='127.0.0.1:9092',
'properties.group.id' = 'action_hitGroup',
'format'= 'json',
'scan.startup.mode'='earliest-offset',
'json.fail-on-missing-field'='false',
'json.ignore-parse-errors'='true'
)
The table above, listens to Kafka and stores data from the topic in Kafka named event_name. Now, I want to ALTER this table, by adding a new column. Following were the ALTER commands I tried running from my flink job:
1. ALTER TABLE event_kafkaTable ADD COLUMN test6 string;
2. ALTER TABLE event_kafkaTable ADD test6 string;
Both these commands threw an Flink SQL Parser exception.
The Flink's official website, https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/alter.html, has not listed the syntax to add or drop a column from a table. Can you please let me know, what is the syntax to add or drop a column to a table using Flink's Table API.
This is not supported yet in the (default) SQL DDL syntax, but you can use the AddColumns and DropColumns Table API methods to perform those operations.
This documentation page has examples on how to use them for each supported language.
I am trying to use QueryDatabase processor using Apache NiFi
Is there any way I can limit the records something like : “select * from table limit 100”
Any other processor in NiFi which supports this operation?
Use ExecuteSQL processor for this case.
Configure/Enable DBCP connection pool
In SQL select query property value keep your select query
select * from table limit 100
Now processor runs the configured sql select query and outputs the results of the query as a flowfile in AVRO format
I'm using flink sql api and I have a sql like
Table result2 = tableEnv.sqlQuery(
"SELECT user, SUM(amount) " +
"FROM Orders " +
"GROUP BY TUMBLE(proctime, INTERVAL '1' DAY), user"
);
Can I enable "allowedLatenness" and getting late data as a side output
Late data handling is not supported in Flink SQL yet (version 1.5.0). Late rows are just dropped.