Pyspark on EMR with JDBC task hangs forever - sql-server

Trying to execute a simple pyspark job on EMR (emr-6.2.0 and Spark 3.0.1) as follows:
spark = (SparkSession
.builder
.appName("Spark Downloader")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.sql.streaming.schemaInference", "true")
.config("maximizeResourceAllocation", "true")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
jdbcDf = (spark.read
.format("jdbc")
.option("url","jdbc:sqlserver://sqlserver:1433;databaseName=dw")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("dbtable", "db.dbo.My200GBTable")
.option("user", "xxx")
.option("password", "xxx")
.option("partitionColumn", "SOMEDATEFIELD")
.option("lowerbound", "2019-01-01")
.option("upperBound", "2022-05-26")
.option("numPartitions", 250).load())
jdbcDf.write.mode("overwrite").partitionBy("SOMEDATEFIELD").parquet("hdfs:///output/tables/My200GBTable/")
spark.stop()
Submitting with:
"Name": "1. Spark Download My200GBTable to HDFS",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--deploy-mode",
"cluster",
"--master",
"yarn",
"--conf",
"spark.jars.packages=com.microsoft.azure:spark-mssql-connector_2.12:1.2.0",
"--conf",
"spark.yarn.submit.waitAppCompletion=true",
"s3a://pyspark_apps/download_table_to_hdfs.py"
In another step I will use s3-dist-cp to send it to an S3 befor the cluster stops.
It works properly, connects and generate all the output down to the _SUCCESS file. However there is always a remaining task preventing the job to finish. Yarn timeline looks like:
The partitions are not skewed, generating evenly sizied files. Am I missing something obvious?

Related

Zeppelin Python Flink cannot print to console

I'm using Kinesis Data Analytics Studio which provides a Zeppelin environment.
Very simple code:
%flink.pyflink
from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
# create env = determine app runs locally or remotely
env = s_env or StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///home/ec2-user/flink-sql-connector-kafka_2.12-1.13.5.jar")
# create a kafka consumer
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(
['id', 'name'],
[Types.INT(), Types.STRING()])
).build()
kafka_consumer = FlinkKafkaConsumer(
topics='nihao',
deserialization_schema=deserialization_schema,
properties={
'bootstrap.servers': 'kakfa-brokers:9092',
'group.id': 'group1'
})
kafka_consumer.set_start_from_earliest()
ds = env.add_source(kafka_consumer)
ds.print()
env.execute('job1')
I can get this working locally can sees change logs being produced to console. However I cannot get the same results in Zeppelin.
Also checked STDOUT in Flink web console task managers, nothing is there too.
Am I missing something? Searched for days and could not find anything on it.
I'm not 100% sure but I think you may need a sink to begin pulling data through the datastream, you could potentially use the included Print Sink Function

How to use iteration data with newman

I have a collection in Postman which loads "payload objects" from a json file and want to make it run in newman from command like.
POST request
Body: of POST request I have got {{jsonBody}}
Pre-request Script: logically pm.globals.set("jsonBody", JSON.stringify(pm.iterationData.toObject()));
and a file.json file with this kind of "objects":
[
{
"data": {
"propert1": 24,
"property2": "24__DDL_VXS",
...
},
{
"data": {
"propert1": 28,
"property2": "28__HDL_VDS",
...
}
...
]
Works like a charm in Postman.
Here is what I'm trying to run in cmd.
newman run \
-d file.json \
--global-var access_token=$TOK4EN \
--folder '/vlanspost' \
postman/postman_collection_v2.json
Based on the results I am getting - it looks like that newman is not resolving flag:
-d, --iteration-data <path> Specify a data file to use for iterations (either JSON or CSV)
and simply passes as payload literally this string from Body section: {{jsonBody}}
Anyone has got the same issue ?
Thx
I did that way and it worked.
Put collection and data file into a same directory. For example:
C:\USERS\DUNGUYEN\DESKTOP\SO
---- file.json
\___ SO.postman_collection.json
From this folder, make newman command.
newman run .\SO.postman_collection.json -d .\file.json --folder 'vlanspost'
This is the result:

Docker, Debezium not streaming data from mssql to elasticsearch

I followed this examples to stream data from mysql to elasticsearch
https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
The example itself works great on my local machine.
But in my case I want to stream data from mssql (which is on another server, not docker) to elasticsearch.
So in the "docker-compose-es.yaml" file i removed "mysql" part and removed the mysql links.
And created my own connectors/sink for elastic and mssql:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "SQ",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
When adding these the kafka connect I/O is working hard and has over 40GB input see image below:
In the kafka logs it looks like its going through all the tables. Here is one of the table logs:
2021-06-17 10:20:10,414 - INFO [data-plane-kafka-request-handler-5:Logging#66] - [Partition MyServer.dbo.TemplateGroup-0 broker=1] Log loaded for partition MyServer.dbo.TemplateGroup-0 with initial high watermark 0
2021-06-17 10:20:10,509 - INFO [data-plane-kafka-request-handler-3:Logging#66] - Creating topic MyServer.dbo.TemplateMeter with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1))
2021-06-17 10:20:10,516 - INFO [data-plane-kafka-request-handler-3:Logging#66] - [KafkaApi-1] Auto creation of topic MyServer.dbo.TemplateMeter with 1 partitions and replication factor 1 is successful
2021-06-17 10:20:10,526 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(MyServer.dbo.TemplateMeter-0)
2021-06-17 10:20:10,528 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [Log partition=MyServer.dbo.TemplateMeter-0, dir=/kafka/data/1] Loading producer state till offset 0 with message format version 2
The database is only 2GB. I'm not sure why it has so high input.
No test_a index was created in elasticsearch when running this command:
curl http://localhost:9200/_aliases?pretty=true
Does anyone know how I troubleshoot from here or point me to the right direction?
Thanks in advance!
how I troubleshoot from here
docker compose logs?
Modify the log4j.properties of Kafka Connect and/or Elasitcsearch processes to get more logs?
Use a regular Kafka consumer to see if data is actually read into the TEST_A topic?
in the "docker-compose-es.yaml" ....
If Debezium is running in a container, then Elasticsearch is not available at localhost:9200
Change that value to http://elastic:9200, like shown in the es-sink.json

Export Data from Hadoop using sql-spark-connector (Apache)

I am trying to export data from Hadoop to MS SQL using Apache Spark SQL Connector as instructed here sql-spark-connector which fails with exception java.lang.NoSuchMethodError: com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer (Lcom/microsoft/sqlserver/jdbc/ISQLServerBulkRecord;)V
According to official documentation Supported Versions
My Development Environment:
Hadoop Version: 2.7.0
Spark Version: 2.4.5
Scala Version: 2.11.12
MS SQL Version: 2016
My Code:
package com.company.test
import org.apache.spark.sql.SparkSession
object TestETL {
def main(args:Array[String]):Unit = {
val spark:SparkSession = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
// create DataFrame
val export_df = Seq(1,2,3).toDF("id")
export_df.show(5)
// Connection String
val server_name = "jdbc:sqlserver://ip_address:port"
val database_name = "database"
val url = server_name + ";" + "databaseName=" + database_name + ";"
export_df.write
.format("com.microsoft.sqlserver.jdbc.spark")
.mode("append")
.option("url", url)
.option("dbtable", "export_test")
.option("user", "username")
.option("password", "password")
.save()
}
}
My SBT
build.sbt
Command line argument I executed
/mapr/abc.company.com/user/dir/spark-2.4.5/bin/spark-submit --class com.company.test.TestETL /mapr/abc.company.com/user/dir/project/TestSparkSqlConnector.jar
JDBC Exception
I de-compiled the mssql-jdbc-8.2.0.jre8.jar to check if it is missing the SQLServerBulkCopy.writeToServer method implementation but that doesn't see to be the case.
Any insights on how I can fix this?
it is a compatability error please to reffer to this link it will explain the error or just choose compatible versions. gitHub link

invoke-sqlcmd - can I change the ApplicationName?

I'm running queries against SQL servers using invoke-sqlcmd and invoke-sqlcmd2.
Is there a way to change the ApplicationName that it runs as? When I run a profiler trace, I see the queries are run by ".Net SqlClient Data Provider", and I'd like to change that.
Any help greatly appreciated
Okay, futzed with the invoke-sqlcmd2 script by Chad Miller and came up with this:
Line 45, after "datarow" I added a comma, then:
[Parameter(Position=9, Mandatory=$false)] [string]$ApplicationName='Powershell'
Then modified the connection strings (about line 54):
if ($Username)
{ $ConnectionString = "Server={0};Database={1};User ID={2};Password={3};Application Name={5};Trusted_Connection=False;Connect Timeout={4}" -f $ServerInstance,$Database,$Username,$Password,$ConnectionTimeout,$ApplicationName }
else
{ $ConnectionString = "Server={0};Database={1};Integrated Security=True;Application Name={3};Connect Timeout={2}" -f $ServerInstance,$Database,$ConnectionTimeout,$ApplicationName }
The default Appname is now "Powershell", but can be changed by using the -ApplicationName parameter.
Include Application Name=MyAppName; in your connection string, and that will be what shows up in the profiler.

Resources