What is the syntax to Add a Column to a table using Flink SQL - apache-flink

given below is the create statement for the table I created using flink.
CREATE TABLE event_kafkaTable (
columnA string,
columnB string,
timeofevent string,
eventTime AS TO_TIMESTAMP(TimestampConverterUtil(timeofevent)),
WATERMARK FOR eventTime AS eventTime - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'event_name',
'properties.bootstrap.servers'='127.0.0.1:9092',
'properties.group.id' = 'action_hitGroup',
'format'= 'json',
'scan.startup.mode'='earliest-offset',
'json.fail-on-missing-field'='false',
'json.ignore-parse-errors'='true'
)
The table above, listens to Kafka and stores data from the topic in Kafka named event_name. Now, I want to ALTER this table, by adding a new column. Following were the ALTER commands I tried running from my flink job:
1. ALTER TABLE event_kafkaTable ADD COLUMN test6 string;
2. ALTER TABLE event_kafkaTable ADD test6 string;
Both these commands threw an Flink SQL Parser exception.
The Flink's official website, https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/sql/alter.html, has not listed the syntax to add or drop a column from a table. Can you please let me know, what is the syntax to add or drop a column to a table using Flink's Table API.

This is not supported yet in the (default) SQL DDL syntax, but you can use the AddColumns and DropColumns Table API methods to perform those operations.
This documentation page has examples on how to use them for each supported language.

Related

Problems importing timestamp from Parquet files

I'm exporting data into Parquet files and importing it into Snowflake. The export is done with python (using to_parquet from pandas) on a Windows Server machine.
The exported file has several timestamp columns. Here's the metadata of one of these columns (ParquetViewer):
I'm having weird issues trying to import the timestamp columns into Snowflake.
Attempt 1 (using the copy into):
create or replace table STAGING.DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0),
"ExitDate" TIMESTAMP_NTZ(9)
);
copy into STAGING.DIM_EMPLOYEE
from #S3
pattern='dim_Employee_.*.parquet'
file_format = (type = parquet)
match_by_column_name = case_insensitive;
select * from STAGING.DIM_EMPLOYEE;
The timestamp column is not imported correctly:
It seems that Snowflake assumes that the value in the column is in seconds and not in microseconds and therefore converts incorrectly.
Attempt 2 (using the external tables):
Then I created an external table:
create or replace external table STAGING.EXT_DIM_EMPLOYEE(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (CAST(GET($1, 'ExitDate') AS TIMESTAMP_NTZ(9)))
)
location=#S3
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE;
The data is still incorrect - still the same issue (seconds instead of microseconds):
Attempt 3 (using the external tables, with modified TO_TIMESTAMP):
I've then modified the external table definition to specifically define that microseconds are used TO_TIMESTAMP_TNZ with scale parameter 6:
create or replace external table STAGING.EXT_DIM_EMPLOYEE_V2(
"EmployeeID" NUMBER(38,0) AS (CAST(GET($1, 'EmployeeID') AS NUMBER(38,0))),
"ExitDate" TIMESTAMP_NTZ(9) AS (TO_TIMESTAMP_NTZ(TO_NUMBER(GET($1, 'ExitDate')), 6))
)
location=#CHICOREE_D365_BI_STAGE/
pattern='dim_Employee_.*.parquet'
file_format='parquet'
;
SELECT * FROM STAGING.EXT_DIM_EMPLOYEE_V2;
Now the data is correct:
But now the "weird" issue appears:
I can load the data into a table, but the load is quite slow and I get a Querying (repair) message during the load. However, at the end, the query is executed, albeit slow:
I want to load the data from stored procedure, using SQL script. When executing the statement using the EXECUTE IMMEDIATE an error is returned:
DECLARE
SQL STRING;
BEGIN
SET SQL := 'INSERT INTO STAGING.DIM_EMPLOYEE ("EmployeeID", "ExitDate") SELECT "EmployeeID", "ExitDate" FROM STAGING.EXT_DIM_EMPLOYEE_V2;';
EXECUTE IMMEDIATE :SQL;
END;
I have also tried to define the timestamp column in an external table as a NUMBER, import it and later convert it into timestamp. This generates the same issue (returning SQL execution internal error in SQL script).
Has anyone experienced an issue like this - it seems to me like a bug?
Basically - my goal is to generate insert/select statements dynamically and execute them (in stored procedures). I have a lot of files (with different schemas) that need to be imported and I want to create an "universal logic" to load these Parquet files into Snowflake.
As confirmed in the Snowflake Support ticket you opened, this issue got resolved when the Snowflake Support team enabled an internal configuration for Parquet timestamp logical types.
If anyone encounters a similar issue please submit a Snowflake Support ticket.

Cassandra DB Inserting data into multiple tables

I have been reading the documentation of Cassandra Db on datastax as well as Apache docs. So far I have learned that we cannot create more than one index (one primary, one secondary index) on a table. And there should be an individual table for each query.
Comparing this to an SQL table for example one on which we want to query 4 fields, for this table in case of Cassandra we should split this table into 4 tables right? ( please correct me if I am wrong ).
on these 4 tables I can have the indexes and make the queries,
My question is How can we insert data into these 4 tables, should I make 4 consecutive insert requests?
my priority is to avoid secondary index
To keep data in sync across denormalised tables, you need to use CQL BATCH statements.
For example, if you had these tables to maintain:
movies
movies_by_actor
movies_by_genre
then you would group the updates in a CQL BATCH like this:
BEGIN BATCH
INSERT INTO movies (...) VALUES (...);
INSERT INTO movies_by_actor (...) VALUES (...);
INSERT INTO movies_by_genre (...) VALUES (...);
APPLY BATCH;
Note that it is also possible to do UPDATE and DELETE statements as well as conditional writes in a batch.
The above example is just to illustrate it in cqlsh and is not used in reality. Here is an example BatchStatement using the Java driver:
SimpleStatement insertMovies =
SimpleStatement.newInstance(
"INSERT INTO movies (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByActor =
SimpleStatement.newInstance(
"INSERT INTO movies_by_actor (...) VALUES (?, ...)", <some_values>);
SimpleStatement insertMoviesByGenre =
SimpleStatement.newInstance(
"INSERT INTO movies_by_genre (...) VALUES (?, ...)", <some_values>);
BatchStatement batch =
BatchStatement.builder(DefaultBatchType.LOGGED)
.addStatement(insertMovies)
.addStatement(insertMoviesByActor)
.addStatement(insertMoviesByGenre)
.build();
For details, see Java driver Batch statements. Cheers!
Cassandra supports secondary Index, SSTable Attached Secondary Index(SASI). Storage Attached Indexes (SAI) has been donated to the project but not yet accepted.
You need to create your tables such that you can get all your required data from table using a single query which looks something like this
SELECT * from keyspace.table_name where key = 'ABC';
So what it means to a as a designer. You need to get all queries identified and on the basis of those queries you define your data model (tables). So if you think you will need 4 tables to satisfy your queries then you are right.
Since all the 4 tables defined by you will have to be in sync if they represent same data, best way is to use batch
BEGIN BATCH
DML_statement1 ;
DML_statement2 ;
DML_statement3 ;
DML_statement4 ;
APPLY BATCH ;
Batch does not guarantee that all statements will be successful are rolled back. It informs the client that group of statements has failed. So client should retry to apply them.
Better to avoid secondary indexes if you can because of performance issues with them. A general rule of thumb is to index a column with low cardinality of few values.

Schema and call flow in Voltdb

what is the format of schema when we create a new table using Voltdb?
I'm a newbie. I have researched for a while and read the explanation in this https://docs.voltdb.com/UsingVoltDB/ChapDesignSchema.php
Please give me more detal about the schema format when I create a new table.
Another quesiton is What is the call flow of the system, since a request comes to the system until a response is create.
Which class/function does it go through in the system.
Since VoltDB is a SQL compliant database, you would create a new table in VoltDB just as you would create a new table in any other traditional relational database. For e.g.,
CREATE TABLE MY_TABLE (id INTEGER NOT NULL, name VARCHAR(10));
You can find all the SQL DDL statements that you can run on VoltDB here
1.Make file yourSchemaName.sql anywhere in the system. Suppose yourSchemaName.sql looks like this
CREATE TABLE Customer (
CustomerID INTEGER UNIQUE NOT NULL,
FirstName VARCHAR(15),
LastName VARCHAR (15),
PRIMARY KEY(CustomerID)
);
2.fire sqlcmd in CLI inside folder where you have installed the voltdB.
if you haven't set the path then you have to type /bin/sqlcmd.
After firing the command, a simple way to load schema in your voltdB database is by typing /path/to/yourSchemaName.sql; command inside the sqlcmd utility and the schema named yourSchemaName.sql will be imported inside the database.
VoltdB is relational database,so Now you can use all of sql database queries inside this database.

Varchar2 in oracle update query from SSIS project

I need to make an update to in Oracle's database using SSIS. I am using the custom database task and using a query like:
UPDATE Table SET column1 = ? where KEY = ?
The key is taken from SQL Server table and is a type of nvarchar(3), the key in oracle database is of type varchar2(3). First it was complaining that the key is 4 character so I change the query to
UPDATE Table SET column1 = ? where KEY = TRIM(CAST(? AS VARCHAR(3)))
It is working for keys which has the 3 characters, but there are also the 2 characters long one. I've tried trim it, convert it. But I cannot make it work for 2 characters keys.
Oracle chararcter set for char is AL32UTF8 and for NCHAR - AL16UTF16.
I've resolved the problem by creating the derived column with KEY_LEN as LEN(key) before the update step. Then I've used it in update as the third param:
UPDATE Table SET column1 = ? where KEY = SUBSTR(?,0,?)

Data migration from MySQL to HSQL

I was working on migrating data from MYSQL to HSQL.
In MYSQL data file, there are plenty of records where date values are set as '0000-00-00' and HSQL database throws below error:
"data exception: invalid datetime format / Error Code: -3407 / State:
22007"
for all such records.
I would like to know what could be optimum solution for this problem?
Thanks in advance
HSQLDB follows the SQL Standard and allows valid dates only. A date such as '0001-01-01' would be a good candidate for the default value.
Regardless of the method used for data inserts, the '0000-00-00' strings should be corrected before insert. One way of doing this is to use a default value for the target column with DEFAULT DATE'0001-01-01' and replace the string in the INSERT statement with the keyword DEFAULT. For example:
CREATE TABLE MYTABLE ( C1 INT, C2 DATE DEFAULT DATE'0001-01-01')
INSERT INTO MYTABLE VALUES 1, DEFAULT
INSERT INTO MYTABLE VALUES 3, '2010-08-14'

Resources