How to read multi directories with Table API in PyFlink? - apache-flink

I want to read multi directories with Table API in PyFlink,
from pyflink.table import StreamTableEnvironment
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
if __name__ == 'main__':
env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.BATCH)
env.set_parallelism(1)
table_env = StreamTableEnvironment.create(stream_execution_environment=env)
table_env \
.get_config() \
.get_configuration() \
.set_string("default.parallelism", "1")
ddl = """
CREATE TABLE test (
a INT,
b STRING
) WITH (
'connector' = 'filesystem',
'path' = '{path}',
'format' = 'csv',
'csv.ignore-first-line' = 'true',
'csv.ignore-parse-errors' = 'true',
'csv.array-element-delimiter' = ';'
)
""".format(path='/opt/data/day=2021-11-14,/opt/data/day=2021-11-15,/opt/data/day=2021-11-16')
table_env.execute_sql(ddl)
But failed with the following error:
Caused by: org.apache.flink.runtime.JobException: Creating the input splits caused an error: File /opt/data/day=2021-11-14,/opt/data/day=2021-11-15,/opt/data/day=2021-11-16 does not exist or the user running Flink ('root') has insufficient permissions to access it.
I'm sure these three directories exists and I have permissions to access it:
/opt/data/day=2021-11-14,
/opt/data/day=2021-11-15,
/opt/data/day=2021-11-16
If not able to read multi directories, I have to create three tables, and union them, which is much more verbose.
Any suggestion is appreciative. Thank you

Just using
'path' = '/opt/data'
Should be sufficient. The filesystem connector is also able to read the partition field and perform filtering based on it. For example you can define the table with this schema:
CREATE TABLE test (
a INT,
b STRING,
day DATE
) PARTITIONED BY (day) WITH (
'connector' = 'filesystem',
'path' = '/opt/data',
[...]
)
And then the following query:
SELECT * FROM test WHERE day = '2021-11-14'
Will read only the file /opt/data/day=2021-11-14

Related

NIFI - upload binary.zip to SQL Server as varbinary

I am trying to upload a binary.zip to SQL Server as varbinary type column content.
Target Table:
CREATE TABLE myTable ( zipFile varbinary(MAX) );
My NIFI Flow is very simple:
-> GetFile:
filter:binary.zip
-> UpdateAttribute:<br>
sql.args.1.type = -3 # as varbinary according to JDBC types enumeration
sql.args.1.value = ??? # I don't know what to put here ! (I've triying everything!)
sql.args.1.format= ??? # Is It required? I triyed 'hex'
-> PutSQL:<br>
SQLstatement= INSERT INTO myTable (zip_file) VALUES (?);
What should I put in sql.args.1.value?
I think it should be the flowfile payload, but it would work as part of the INSERT in the PutSQL? Not by the moment!
Thanks!
SOLUTION UPDATE:
Based on https://issues.apache.org/jira/browse/NIFI-8052
(Consider I'm sending some data as attribute parameter)
import java.nio.charset.StandardCharsets
import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
def flowFile = session.get()
def lookup = context.controllerServiceLookup
def dbServiceName = flowFile.getAttribute('DatabaseConnectionPoolName')
def tableName = flowFile.getAttribute('table_name')
def fieldName = flowFile.getAttribute('field_name')
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find
{ cs -> lookup.getControllerServiceName(cs) == dbServiceName }
def conn = lookup.getControllerService(dbcpServiceId)?.getConnection()
def sql = new Sql(conn)
flowFile.read{ rawIn->
def parms = [rawIn ]
sql.executeInsert "INSERT INTO " + tableName + " (date, "+ fieldName + ") VALUES (CAST( GETDATE() AS Date ) , ?) ", parms
}
conn?.close()
if(!flowFile) return
session.transfer(flowFile, REL_SUCCESS)
session.commit()
maybe there is a nifi native way to insert blob however you could use ExecuteGroovyScript instead of UpdateAttribute and PutSQL
add SQL.mydb parameter on the level of processor and link it to required DBCP pool.
use following script body:
def ff=session.get()
if(!ff)return
def statement = "INSERT INTO myTable (zip_file) VALUES (:p_zip_file)"
def params = [
p_zip_file: SQL.mydb.BLOB(ff.read()) //cast flow file content as BLOB sql type
]
SQL.mydb.executeInsert(params, statement) //committed automatically on flow file success
//transfer to success without changes
REL_SUCCESS << ff
inside the script SQL.mydb is a reference to groovy.sql.Sql oblject

Querying nested row with Python in Flink

Based upon the pyflink walkthrough, I'm trying to now get a simple nested row query working using apache-flink==1.14.4. I've created my table structure based upon this solution: Get nested fields from Kafka message using Apache Flink SQL
A message looks like this:
{"signature": {"token": "abcd1234"}}
The relevant part of the code looks like this:
create_kafka_source_ddl = """
CREATE TABLE nested_msg (
`signature` ROW (
`token` STRING
)
) WITH (
'connector' = 'kafka',
'topic' = 'nested_msg',
'properties.bootstrap.servers' = 'kafka:9092',
'properties.group.id' = 'nested-msg',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
create_es_sink_ddl = """
CREATE TABLE es_sink (
token STRING
) WITH (
'connector' = 'elasticsearch-7',
'hosts' = 'http://elasticsearch:9200',
'index' = 'nested_count_1',
'document-id.key-delimiter' = '$',
'sink.bulk-flush.max-size' = '42mb',
'sink.bulk-flush.max-actions' = '32',
'sink.bulk-flush.interval' = '1000',
'sink.bulk-flush.backoff.delay' = '1000',
'format' = 'json'
)
"""
t_env.execute_sql(create_kafka_source_ddl)
t_env.execute_sql(create_es_sink_ddl)
# How do I select the nested field here?
t_env.from_path("nested_msg").select(col("signature.token").alias("token")).select(
"token"
).execute_insert("es_sink")
I've tried numerous variations here without success. The exception is:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.select.
: org.apache.flink.table.api.ValidationException: Cannot resolve field [signature.token], input field list:[signature].
How can I selected a nested field like this in order to insert it into my sink?
You can change col("signature.token") to col("signature").get('token').

Flink Table API -> Streaming Sink?

I see examples that convert a Flink Table object to a DataStream and run StreamExecutionEnvironment.execute.
how would I code + run a continuous query that writes to a Streaming Sink with the table API without converting to a DataStream.
It seems this must be possible, because otherwise what is the purpose of specifying streaming sink Table Connectors?
The Table API docs list continuous queries and dynamic tables, yet most of the actual Java APIs and code examples seem to only use the table API for batch.
EDIT: To show David Anderson what I'm trying, here are the three Flink SQL CREATE TABLE statements on top of analogous Derby SQL tables.
I see the JDBC table connector sink supports streaming, but am I not configuring this correctly? I don't see anything that I'm overlooking.
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/jdbc.html
FYI, when I get my toy example working, I am planning on using Kafka in production for input/output stream-like data and JDBC/SQL for the lookup table.
CREATE TABLE LookupTableFlink (
`lookup_key` STRING NOT NULL,
`lookup_value` STRING NOT NULL,
PRIMARY KEY (lookup_key) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'LookupTable'
),
CREATE TABLE IncomingEventsFlink (
`field_to_use_as_lookup_key` STRING NOT NULL,
`extra_field` INTEGER NOT NULL,
`proctime` AS PROCTIME()
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'IncomingEvents'
), jdbcUrl);
CREATE TABLE TransformedEventsFlink (
`field_to_use_as_lookup_key` STRING,
`extra_field` INTEGER,
`lookup_key` STRING,
`lookup_value` STRING
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:derby:memory:myDB;create=false',
'table-name' = 'TransformedEvents'
), jdbcUrl);
String sqlQuery =
"SELECT\n" +
" IncomingEventsFlink.field_to_use_as_lookup_key, IncomingEventsFlink.extra_field,\n" +
" LookupTableFlink.lookup_key, LookupTableFlink.lookup_value\n" +
"FROM IncomingEventsFlink\n" +
"LEFT JOIN LookupTableFlink FOR SYSTEM_TIME AS OF IncomingEventsFlink.proctime\n" +
"ON (IncomingEventsFlink.field_to_use_as_lookup_key = LookupTableFlink.lookup_key)\n";
Table joinQuery = tableEnv.sqlQuery(sqlQuery);
// This seems to run, return, and complete and doesn't seem to run in continuous/streaming mode.
TableResult tableResult = joinQuery.executeInsert(TransformedEventsFlink);
You can write to a dynamic table by using executeInsert, as in
Table orders = tableEnv.from("Orders");
orders.executeInsert("OutOrders");
The documentation is here.
It's explained here.
code example can be found here:
// get StreamTableEnvironment.
StreamTableEnvironment tableEnv = ...; // see "Create a TableEnvironment" section
// Table with two fields (String name, Integer age)
Table table = ...
// convert the Table into an append DataStream of Row by specifying the class
DataStream<Row> dsRow = tableEnv.toAppendStream(table, Row.class);
// convert the Table into an append DataStream of Tuple2<String, Integer>
// via a TypeInformation
TupleTypeInfo<Tuple2<String, Integer>> tupleType = new TupleTypeInfo<>(
Types.STRING(),
Types.INT());
DataStream<Tuple2<String, Integer>> dsTuple =
tableEnv.toAppendStream(table, tupleType);
// convert the Table into a retract DataStream of Row.
// A retract stream of type X is a DataStream<Tuple2<Boolean, X>>.
// The boolean field indicates the type of the change.
// True is INSERT, false is DELETE.
DataStream<Tuple2<Boolean, Row>> retractStream =
tableEnv.toRetractStream(table, Row.class);

String delimiter present in string not permitted in Polybase?

I'm creating an external table using a CSV stored in an Azure Data Lake Storage and populating the table using Polybase in SQL Server.
However, I ran into this problem and figured it may be due to the fact that in one particular column there are double quotes present within the string, and the string delimiter has been specified as " in Polybase (STRING_DELIMITER = '"').
HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopExecutionException: Could not find a delimiter after string delimiter
Example:
I have done quite an extensive research in this and found that this issue has been around for years but yet to see any solutions given.
Any help will be appreciated.
I think the easiest way to fix this up because you are in charge of the .csv creation is to use a delimiter which is not a comma and leave off the string delimiter. Use a separator which you know will not appear in the file. I've used a pipe in my example, and I clean up the string once it is imported in to the database.
A simple example:
IF EXISTS ( SELECT * FROM sys.external_tables WHERE name = 'delimiterWorking' )
DROP EXTERNAL TABLE delimiterWorking
GO
IF EXISTS ( SELECT * FROM sys.tables WHERE name = 'cleanedData' )
DROP TABLE cleanedData
GO
IF EXISTS ( SELECT * FROM sys.external_file_formats WHERE name = 'ff_delimiterWorking' )
DROP EXTERNAL FILE FORMAT ff_delimiterWorking
GO
CREATE EXTERNAL FILE FORMAT ff_delimiterWorking
WITH (
FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
--STRING_DELIMITER = '"',
FIRST_ROW = 2,
ENCODING = 'UTF8'
)
);
GO
CREATE EXTERNAL TABLE delimiterWorking (
id INT NOT NULL,
body VARCHAR(8000) NULL
)
WITH (
LOCATION = 'yourLake/someFolder/delimiterTest6.txt',
DATA_SOURCE = ds_azureDataLakeStore,
FILE_FORMAT = ff_delimiterWorking,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);
GO
SELECT *
FROM delimiterWorking
GO
-- Fix up the data
CREATE TABLE cleanedData
WITH (
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS
SELECT
id,
body AS originalCol,
SUBSTRING ( body, 2, LEN(body) - 2 ) cleanBody
FROM delimiterWorking
GO
SELECT *
FROM cleanedData
My results:
String Delimiter issue can be avoided if you have the Data lake flat file converted to Parquet format.
Input:
"ID"
"NAME"
"COMMENTS"
"1"
"DAVE"
"Hi "I am Dave" from"
"2"
"AARO"
"AARO"
Steps:
1 Convert Flat file to Parquet format [Using Azure Data factory]
2 Create External File format in Data Lake [Assuming Master key, Scope credentials available]
CREATE EXTERNAL FILE FORMAT PARQUET_CONV
WITH (FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
3 Create External Table with FILE_FORMAT = PARQUET_CONV
Output:
I believe this is the best option as Microsoft don't have an solution currently to handle this string delimiter occurring with in the data for External table

Apache Flink 1.11 Streaming Sink to S3

I'm using the Flink FileSystem SQL Connector to read events from Kafka and write to S3(Using MinIo). Here is my code,
exec_env = StreamExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
# start a checkpoint every 10 s
exec_env.enable_checkpointing(10000)
exec_env.set_state_backend(FsStateBackend("s3://test-bucket/checkpoints/"))
t_config = TableConfig()
t_env = StreamTableEnvironment.create(exec_env, t_config)
INPUT_TABLE = "source"
INPUT_TOPIC = "Rides"
LOCAL_KAFKA = 'kafka:9092'
OUTPUT_TABLE = "sink"
ddl_source = f"""
CREATE TABLE {INPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'kafka',
'topic' = '{INPUT_TOPIC}',
'properties.bootstrap.servers' = '{LOCAL_KAFKA}',
'format' = 'json'
)
"""
ddl_sink = f"""
CREATE TABLE {OUTPUT_TABLE} (
`rideId` BIGINT,
`isStart` BOOLEAN,
`eventTime` STRING,
`lon` FLOAT,
`lat` FLOAT,
`psgCnt` INTEGER,
`taxiId` BIGINT
) WITH (
'connector' = 'filesystem',
'path' = 's3://test-bucket/kafka_output',
'format' = 'parquet'
)
"""
t_env.sql_update(ddl_source)
t_env.sql_update(ddl_sink)
t_env.execute_sql(f"""
INSERT INTO {OUTPUT_TABLE}
SELECT *
FROM {INPUT_TABLE}
""")
I'm using Flink 1.11.3 and flink-s3-fs-hadoop 1.11.3. I have copied the flink-s3-fs-hadoop-1.11.3.jar into the plugins folder.
cp /opt/flink/lib/flink-s3-fs-hadoop-1.11.3.jar /opt/flink/plugins/s3-fs-hadoop/;
Also I have added the following configs into the flink-conf.yaml.
state.backend: filesystem
state.checkpoints.dir: s3://test-bucket/checkpoints/
s3.endpoint: http://127.0.0.1:9000
s3.path.style.access: true
s3.access-key: minio
s3.secret-key: minio123
MinIo is running properly and I have created the 'test-bucket' in MinIo. When I run this job the job submission doesn't happen and Flink Dashboard goes to a some sort of waiting state. After 15-20 mins I get the following exception,
pyflink.util.exceptions.TableException: Failed to execute sql
What seems to be the problem here?

Resources