I am converting some legacy Java code written for Flink version 1.5 to Flink version 1.13.1. Specifically, I'm working with Table API. I have to read data from CSV file, perform some basic SQL and then write results back to a file.
For Flink version 1.5, I used the following code to perform above actions
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
TableSource tableSrc = CsvTableSource.builder()
.path("<CSV_PATH>")
.fieldDelimiter(",")
.field("date", Types.STRING)
.field("month", Types.STRING)
...
.build();
tableEnv.registerTableSource("CatalogTable", tableSrc);
String sql = "...";
Table result = tableEnv.sqlQuery(sql);
DataSet<Row1> resultSet = tableEnv.toDataSet(result, Row1.class);
resultSet.writeAsText("<OUT_PATH>");
env.execute("Flink Table-Sql Example");
In order to convert above code to Flink version 1.13.1, I wrote the following code
import org.apache.flink.table.api.Table;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.BatchTableEnvironment;
EnvironmentSettings settings = EnvironmentSettings
.newInstance()
.inBatchMode()
.build();
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
TableEnvironment tableEnv = TableEnvironment.create(settings);
final String tableDDL = "CREATE TEMPORARY TABLE CatalogTable (" +
"date STRING, " +
"month STRING, " +
"..." +
") WITH (" +
"'connector' = 'filesystem', " +
"'path' = 'file:///CSV_PATH', " +
"'format' = 'csv'" +
")";
tableEnv.executeSql(tableDDL);
String sql = "...";
Table result = tableEnv.sqlQuery(sql);
// DEPRECATED - BatchTableEnvironment required to convert Table to Dataset
BatchTableEnvironment bTableEnv = BatchTableEnvironment.create(env);
DataSet<Row1> resultSet = bTableEnv.toDataSet(result, Row1.class);
resultSet.writeAsText("<OUT_PATH>");
env.execute("Flink Table-Sql Example");
However, BatchTableEnvironment is marked as "Deprecated" in Flink version 1.13. Is there any alternative to convert Table to Dataset or to directly write a Table to a file?
Related
I understood that you could not do a full snowflake data dump and need to use the COPY command to unload data from a table into an internal (i.e. Snowflake) stage.
To automate the process, I thought to do it with Python. Do you think that is the best method?
import traceback
import snowflake.connector
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
url = URL(
user='??????',
password='????????',
account='??????-??????',
database='SNOWFLAKE',
role = 'ACCOUNTADMIN'
)
out_put_string = ""
try:
engine = create_engine(url)
connection = engine.connect()
# Get all the views from the SNOWFLAKE database
query = '''
show views in database SNOWFLAKE
'''
df = pd.read_sql(query, connection)
# Loop over all the views
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
out_put_string += "VIEW:----------" + row['schema_name'] + "." + row['name'] + "----------\n"
df_view = pd.read_sql('select * from ' + row['schema_name'] + "." + row['name'], connection)
df_view.to_csv("/Temp/Output_CVS/" + row['schema_name'] + "-" + row['name'] + ".csv")
out_put_string += df_view.to_string() + "\n"
except:
print("ERROR:")
traceback.print_exc()
connection.close()
#Export all the Views in one file
text_file = open("/Temp/Output_CVS/AllViewsData.txt", "w")
text_file.write(out_put_string)
text_file.close()
I've had the scenario where I define a kafka source, UDF | UDTF for processing and sink to a Kafka sink. Doesn't matter what I do, if I run the job, the output is flood with the processed output of a single input record. For illustrative purposes, this is what's output on the defined kafka sink topic:
distinct timestamps, showing that the UDF is entered for each respective input record, but the same input record was processed:
By trying to figure out the problem I've read through whatever flink documentation I could find (and rabbit hole of links) in terms of enforcing 'semantic EXACTLY ONCE' processing of records. As far as I can gather it comes down to these following settings:
This guy presented the best visual representation for me to understand semantic_once_video
Kafka source (consumer)
kafka source property of isolation level = read_committed
Kafka sinks (producer)
Kafka sink property of processing mode = exactly_once
Kafka sink property of idempotence = true
Utilizing checkpointing
Also referencing stackoverflow questions I could find on the topic (mainly discussing in terms of Java implementations)... needless to say, still not resolved. Here's my code for reference:
import os
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import TableEnvironment, EnvironmentSettings, DataTypes, StreamTableEnvironment
from pyflink.table.udf import ScalarFunction, TableFunction, udf, udtf
from pyflink.datastream.checkpointing_mode import CheckpointingMode
KAFKA_SERVERS = os.getenv('KAFKA_BS_SERVERS',"localhost:9094").split(',')
KAFKA_USERNAME = "xxx"
KAFKA_PASSWORD = "_pass_"
KAFKA_SOURCE_TOPIC = 'source_topic'
KAFKA_SINK_TOPIC = 'sink_topic'
KAFKA_GROUP_ID = 'testgroup12'
JAR_DEPENDENCIES = os.getenv('JAR_DEPENDENCIES', '/opt/flink/lib_py')
class tbl_function(TableFunction):
def open(self, function_context):
pass
def eval(self, *args):
import json
from datetime import datetime
res = {
'time': str(datetime.utcnow()),
'input': json.loads(args[0])
}
yield json.dumps(res)
def pipeline():
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
for file in os.listdir(JAR_DEPENDENCIES):
if file.find('.jar') != -1:
env.add_jars(f"file://{JAR_DEPENDENCIES}/{file}")
print(f"added jar dep: {JAR_DEPENDENCIES}/{file}")
env.enable_checkpointing(60000, CheckpointingMode.EXACTLY_ONCE)
env.get_checkpoint_config().set_min_pause_between_checkpoints(120000)
env.get_checkpoint_config().enable_unaligned_checkpoints()
env.get_checkpoint_config().set_checkpoint_interval(30000)
settings = EnvironmentSettings.new_instance()\
.in_streaming_mode()\
.use_blink_planner()\
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment= env, environment_settings=settings)
source_ddl = f"""
CREATE TABLE source_table(
entry STRING
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SOURCE_TOPIC}',
'properties.bootstrap.servers' = '{','.join(KAFKA_SERVERS)}',
'properties.isolation_level' = 'read_committed',
'properties.group.id' = '{KAFKA_GROUP_ID}',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'scan.startup.mode' = 'earliest-offset',
'format' = 'raw'
)
"""
sink_ddl = f"""
CREATE TABLE sink_table(
entry STRING
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SINK_TOPIC}',
'properties.bootstrap.servers' = '{','.join(KAFKA_SERVERS)}',
'properties.group.id' = '{KAFKA_GROUP_ID}',
'properties.processing.mode' = 'exactly_once',
'properties.enable.idempotence' = 'true',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'format' = 'raw'
)
"""
t_env.execute_sql(source_ddl).wait()
t_env.execute_sql(sink_ddl).wait()
f = tbl_function()
table_udf = udtf(f, result_types=[DataTypes.STRING()])
t_env.create_temporary_function("table_f", table_udf)
table = t_env.from_path('source_table')
table = table.join_lateral('table_f(entry) as (content)')
table = table.select('content').alias('entry')
table.insert_into('sink_table')
from datetime import datetime
t_env.execute(f"dummy_test_{str(datetime.now())}")
if __name__ == '__main__':
pipeline()
jar dependencies:
added jar dep: /opt/flink/lib_py/flink-sql-connector-kafka_2.12-1.14.2.jar
added jar dep: /opt/flink/lib_py/flink-connector-kafka_2.12-1.14.2.jar
added jar dep: /opt/flink/lib_py/kafka-clients-2.4.1.jar
After a whole bunch of trial-error, and still not precisely knowing why this resolved the issue (or underlying pyflink issue?), I found that if you utilize a table meta-data field in your source definition, that would somehow initialize or synchronize your pipeline to produce a semantic.EXACTLY_ONCE data flow (1 record in = 1 record out, no duplicates).
The only change that I had to made is 1 line of meta data code in the DDL source definition. (Again providing my full script for reference):
import os
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import TableEnvironment, EnvironmentSettings, DataTypes, StreamTableEnvironment
from pyflink.table.udf import ScalarFunction, TableFunction, udf, udtf
from pyflink.datastream.checkpointing_mode import CheckpointingMode
KAFKA_SERVERS = os.getenv('KAFKA_BS_SERVERS',"localhost:9094").split(',')
KAFKA_USERNAME = "xxx"
KAFKA_PASSWORD = "_pass_"
KAFKA_SOURCE_TOPIC = 'source_topic'
KAFKA_SINK_TOPIC = 'sink_topic'
KAFKA_GROUP_ID = 'testgroup12'
JAR_DEPENDENCIES = os.getenv('JAR_DEPENDENCIES', '/opt/flink/lib_py')
class tbl_function(TableFunction):
def open(self, function_context):
pass
def eval(self, *args):
import json
from datetime import datetime
res = {
'time': str(datetime.utcnow()),
'input': json.loads(args[0])
}
yield json.dumps(res)
def pipeline():
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
for file in os.listdir(JAR_DEPENDENCIES):
if file.find('.jar') != -1:
env.add_jars(f"file://{JAR_DEPENDENCIES}/{file}")
print(f"added jar dep: {JAR_DEPENDENCIES}/{file}")
env.enable_checkpointing(60000, CheckpointingMode.EXACTLY_ONCE)
env.get_checkpoint_config().set_min_pause_between_checkpoints(120000)
env.get_checkpoint_config().enable_unaligned_checkpoints()
env.get_checkpoint_config().set_checkpoint_interval(30000)
settings = EnvironmentSettings.new_instance()\
.in_streaming_mode()\
.use_blink_planner()\
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment= env, environment_settings=settings)
# this sneaky bugger line -> with 'event_time'
source_ddl = f"""
CREATE TABLE source_table(
entry STRING,
event_time TIMESTAMP(3) METADATA FROM 'timestamp'
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SOURCE_TOPIC}',
'properties.bootstrap.servers' = '{','.join(KAFKA_SERVERS)}',
'properties.isolation_level' = 'read_committed',
'properties.group.id' = '{KAFKA_GROUP_ID}',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'scan.startup.mode' = 'earliest-offset',
'format' = 'raw'
)
"""
sink_ddl = f"""
CREATE TABLE sink_table(
entry STRING
) WITH (
'connector' = 'kafka',
'topic' = '{KAFKA_SINK_TOPIC}',
'properties.bootstrap.servers' = '{','.join(KAFKA_SERVERS)}',
'properties.group.id' = '{KAFKA_GROUP_ID}',
'properties.processing.mode' = 'exactly_once',
'properties.enable.idempotence' = 'true',
'properties.sasl.mechanism' = 'PLAIN',
'properties.security.protocol' = 'SASL_PLAINTEXT',
'properties.sasl.jaas.config' = 'org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{KAFKA_USERNAME}\" password=\"{KAFKA_PASSWORD}\";',
'format' = 'raw'
)
"""
t_env.execute_sql(source_ddl).wait()
t_env.execute_sql(sink_ddl).wait()
f = tbl_function()
table_udf = udtf(f, result_types=[DataTypes.STRING()])
t_env.create_temporary_function("table_f", table_udf)
table = t_env.from_path('source_table')
table = table.join_lateral('table_f(entry) as (content)')
table = table.select('content').alias('entry')
table.insert_into('sink_table')
from datetime import datetime
t_env.execute(f"dummy_test_{str(datetime.now())}")
if __name__ == '__main__':
pipeline()
Hope this saves someone time, unlike the 3 days I spent in trial-error #Sigh :(
I want to run locally SQL expressions over datastreams from Kafka, Kinesis, etc.
I've tried running the following code, which basically creates a datastream source from kafka, and registers it as a table, then I run a select * on it, and get the result back as a datastream. CollectSink is just a utility sink I have to be able to debug those messages.
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironment()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, EnvironmentSettings.inStreamingMode())
val source = KafkaSource.builder<String>()
.setBootstrapServers(defaultKafkaProperties["bootstrap.servers"] as String)
.setTopics(topicName)
.setGroupId("my-group-test" + Math.random())
.setStartingOffsets(OffsetsInitializer.earliest())
.setValueOnlyDeserializer(SimpleStringSchema())
.build()
val stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source")
val table = tableEnv.fromDataStream(stream)
tableEnv.createTemporaryView("Source", table);
val resultTable = tableEnv.sqlQuery("select * from Source")
val resultStream = tableEnv.toDataStream(resultTable)
resultStream.addSink(CollectSink())
env.execute()
I'm always getting the following error, and I don't know why, as I'm not using scala in my application code, so I assume I'm missing a dependency somewhere?
java.lang.NoClassDefFoundError: scala/Serializable
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:821)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:719)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:642)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:600)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
at org.apache.flink.table.planner.expressions.PlannerTypeInferenceUtilImpl.<clinit>(PlannerTypeInferenceUtilImpl.java:51)
at org.apache.flink.table.planner.delegation.PlannerBase.<init>(PlannerBase.scala:92)
at org.apache.flink.table.planner.delegation.StreamPlanner.<init>(StreamPlanner.scala:52)
at org.apache.flink.table.planner.delegation.DefaultPlannerFactory.create(DefaultPlannerFactory.java:61)
at org.apache.flink.table.factories.PlannerFactoryUtil.createPlanner(PlannerFactoryUtil.java:50)
at org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.create(StreamTableEnvironmentImpl.java:151)
at org.apache.flink.table.api.bridge.java.StreamTableEnvironment.create(StreamTableEnvironment.java:128)
These are the dependencies I'm using
val flinkVersion = "1.14.3"
val flinkScalaVersion = 2.12
implementation("org.apache.flink:flink-clients_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-api-java-bridge_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-planner_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-streaming-scala_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-table-common:$flinkVersion")
implementation("org.apache.flink:flink-connector-kafka_$flinkScalaVersion:$flinkVersion")
implementation("org.apache.flink:flink-avro-confluent-registry:$flinkVersion")
implementation("org.apache.flink:flink-avro:$flinkVersion")
I am trying to use the asyncio packages to execute concurrent calls from one SQL Server to another in order to extract data. I'm hitting an issue of at the portion of myLoop.run_until_complete(cors) where it is telling me that the event loop is already running. I will admit that I am new to this package and may be overlooking something simple.
import pyodbc
import sqlalchemy
import pandas
import asyncio
import time
async def getEngine(startString):
sourceList = str.split(startString,'=')
server = str.split(sourceList[1],';')[0]
database = str.split(sourceList[2],';')[0]
user = str.split(sourceList[3],';')[0]
password = str.split(sourceList[4],';')[0]
returnEngine = sqlalchemy.create_engine("mssql+pyodbc://"+user+":"+password+"#"+server+"/"+database+"?driver=SQL+Server+Native+Client+11.0")
return returnEngine
async def getConnString(startString):
sourceList = str.split(startString,'=')
server = str.split(sourceList[1],';')[0]
database = str.split(sourceList[2],';')[0]
user = str.split(sourceList[3],';')[0]
password = str.split(sourceList[4],';')[0]
return "Driver={SQL Server Native Client 11.0};Server="+server+";Database="+database+";Uid="+user+";Pwd="+password+";"
async def executePackage(source,destination,query,sourceTable,destTable,lastmodifiedDate,basedOnStation):
sourceConnString = getConnString(source)
destEngine = getEngine(destination)
sourceConn = pyodbc.connect(sourceConnString)
newQuery = str.replace(query,'dateTest',str(lastmodifiedDate))
df = pandas.read_sql(newQuery,sourceConn)
print('Started '+sourceTable+'->'+destTable)
tic = time.perf_counter()
await df.to_sql(destTable,destEngine,index=False,if_exists="append")
toc = time.perf_counter()
secondsToFinish = toc - tic
print('Finished '+sourceTable+'->'+destTable+' in '+ str(secondsToFinish) +' seconds')
async def main():
connString = "Driver={SQL Server Native Client 11.0};Server=myServer;Trusted_Connection=yes;"
myConn = pyodbc.connect(connString)
cursor = myConn.cursor()
df = pandas.read_sql('exec mySql_stored_proc',myConn)
if len(df.index) > 0:
tasks = [executePackage(df.iloc[i,10],df.iloc[i,11],df.iloc[i,7],df.iloc[i,8],df.iloc[i,9],df.iloc[i,5],df.iloc[i,17])for i in range(len(df))]
myLoop = asyncio.get_event_loop()
cors = asyncio.wait(tasks)
myLoop.run_until_complete(cors)
if __name__ =="__main__":
asyncio.run(main())
I am using glue job to write data pipeline. I took code from community, which is as following
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
#args = getResolvedOptions(sys.argv, ['JOB_NAME'])
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA', 'USERNAME', 'PASSWORD', 'ROLE'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
sparkSession = glueContext.spark_session
glueJob = Job(glueContext)
glueJob.init(args['JOB_NAME'], args)
##Use the CData JDBC driver to read Snowflake data from the Products table into a DataFrame
##Note the populated JDBC URL and driver class name
java_import(sparkSession._jvm, SNOWFLAKE_SOURCE_NAME)
sparkSession._jvm.net.snowflake.spark.snowflake.SnowflakeConnectorUtils.enablePushdownSession(sparkSession._jvm.org.apache.spark.sql.SparkSession.builder().getOrCreate())
tmp_dir=args["TempDir"]
sfOptions = {
"sfURL" : args['URL'],
"sfAccount" : args['ACCOUNT'],
"sfUser" : args['USERNAME'],
"sfPassword" : args['PASSWORD'],
"sfDatabase" : args['DB'],
"sfSchema" : args['SCHEMA'],
"sfRole" : args['ROLE'],
"sfWarehouse" : args['WAREHOUSE'],
"preactions" : "USE DATABASE dev_lz;",
}
#"tempDir" : tmp_dir,
print('=========DB Connection details ================== ', sfOptions)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "aws-nonprod-datalake-glue-catalog", table_name = "nm_s_amaster", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [ mappings], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = [columns], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "aws-nonprod-datalake-glue-catalog", table_name = "NM_TEMP", transformation_ctx = "resolvechoice3")
resolvechoice4 = ResolveChoice.apply(frame = resolvechoice3, choice = "make_cols", transformation_ctx = "resolvechoice4")
##Convert DataFrames to AWS Glue's DynamicFrames Object
resolvechoice4.toDF().write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("preactions","USE DATABASE dev_lz").option("dbtable", "nm_temp").mode("overwrite").save()
glueJob.commit()
But after running code i am getting
net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Table 'NM_TEMP_STAGING_1100952600' does not exist
please let me know if I am missing anything.
I have permission for create, select stage, create, select table and create future tables.
above code I have removed columns and mappings. but original code it is available.
resolvechoice4.toDF().write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("preactions","USE DATABASE dev_lz").option("dbtable", "nm_temp").mode("overwrite").save()
Added following in above dbtable option it started working,
.option("preactions","USE ROLE DEVELOPER;USE DATABASE dev_db;USE SCHEMA aws_test")
as following
resolvechoice4.toDF().write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("preactions","USE DATABASE dev_lz").option("preactions","USE ROLE DEVELOPER;USE DATABASE dev_db;USE SCHEMA aws_test").option("dbtable", "nm_temp").mode("overwrite").save()