Pyspark on EMR with JDBC task hangs forever

Pyspark on EMR with JDBC task hangs forever - sql-server

Trying to execute a simple pyspark job on EMR (emr-6.2.0 and Spark 3.0.1) as follows:
spark = (SparkSession
.builder
.appName("Spark Downloader")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.sql.streaming.schemaInference", "true")
.config("maximizeResourceAllocation", "true")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
jdbcDf = (spark.read
.format("jdbc")
.option("url","jdbc:sqlserver://sqlserver:1433;databaseName=dw")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("dbtable", "db.dbo.My200GBTable")
.option("user", "xxx")
.option("password", "xxx")
.option("partitionColumn", "SOMEDATEFIELD")
.option("lowerbound", "2019-01-01")
.option("upperBound", "2022-05-26")
.option("numPartitions", 250).load())
jdbcDf.write.mode("overwrite").partitionBy("SOMEDATEFIELD").parquet("hdfs:///output/tables/My200GBTable/")
spark.stop()
Submitting with:
"Name": "1. Spark Download My200GBTable to HDFS",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--deploy-mode",
"cluster",
"--master",
"yarn",
"--conf",
"spark.jars.packages=com.microsoft.azure:spark-mssql-connector_2.12:1.2.0",
"--conf",
"spark.yarn.submit.waitAppCompletion=true",
"s3a://pyspark_apps/download_table_to_hdfs.py"
In another step I will use s3-dist-cp to send it to an S3 befor the cluster stops.
It works properly, connects and generate all the output down to the _SUCCESS file. However there is always a remaining task preventing the job to finish. Yarn timeline looks like:
The partitions are not skewed, generating evenly sizied files. Am I missing something obvious?

Related

Zeppelin Python Flink cannot print to console

I'm using Kinesis Data Analytics Studio which provides a Zeppelin environment.
Very simple code:
%flink.pyflink
from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
# create env = determine app runs locally or remotely
env = s_env or StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///home/ec2-user/flink-sql-connector-kafka_2.12-1.13.5.jar")
# create a kafka consumer
deserialization_schema = JsonRowDeserializationSchema.builder() \
.type_info(type_info=Types.ROW_NAMED(
['id', 'name'],
[Types.INT(), Types.STRING()])
).build()
kafka_consumer = FlinkKafkaConsumer(
topics='nihao',
deserialization_schema=deserialization_schema,
properties={
'bootstrap.servers': 'kakfa-brokers:9092',
'group.id': 'group1'
})
kafka_consumer.set_start_from_earliest()
ds = env.add_source(kafka_consumer)
ds.print()
env.execute('job1')
I can get this working locally can sees change logs being produced to console. However I cannot get the same results in Zeppelin.
Also checked STDOUT in Flink web console task managers, nothing is there too.
Am I missing something? Searched for days and could not find anything on it.

I'm not 100% sure but I think you may need a sink to begin pulling data through the datastream, you could potentially use the included Print Sink Function

How to use iteration data with newman

I have a collection in Postman which loads "payload objects" from a json file and want to make it run in newman from command like.
POST request
Body: of POST request I have got {{jsonBody}}
Pre-request Script: logically pm.globals.set("jsonBody", JSON.stringify(pm.iterationData.toObject()));
and a file.json file with this kind of "objects":
[
{
"data": {
"propert1": 24,
"property2": "24__DDL_VXS",
...
},
{
"data": {
"propert1": 28,
"property2": "28__HDL_VDS",
...
}
...
]
Works like a charm in Postman.
Here is what I'm trying to run in cmd.
newman run \
-d file.json \
--global-var access_token=$TOK4EN \
--folder '/vlanspost' \
postman/postman_collection_v2.json
Based on the results I am getting - it looks like that newman is not resolving flag:
-d, --iteration-data <path> Specify a data file to use for iterations (either JSON or CSV)
and simply passes as payload literally this string from Body section: {{jsonBody}}
Anyone has got the same issue ?
Thx

I did that way and it worked.
Put collection and data file into a same directory. For example:
C:\USERS\DUNGUYEN\DESKTOP\SO
---- file.json
\___ SO.postman_collection.json
From this folder, make newman command.
newman run .\SO.postman_collection.json -d .\file.json --folder 'vlanspost'
This is the result:

Docker, Debezium not streaming data from mssql to elasticsearch

I followed this examples to stream data from mysql to elasticsearch
https://github.com/debezium/debezium-examples/tree/master/unwrap-smt#elasticsearch-sink
The example itself works great on my local machine.
But in my case I want to stream data from mssql (which is on another server, not docker) to elasticsearch.
So in the "docker-compose-es.yaml" file i removed "mysql" part and removed the mysql links.
And created my own connectors/sink for elastic and mssql:
{
"name": "Test-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"database.hostname": "192.168.1.234",
"database.port": "1433",
"database.user": "user",
"database.password": "pass",
"database.dbname": "Test",
"database.server.name": "MyServer",
"table.include.list": "dbo.TEST_A",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "dbhistory.testA"
}
}
{
"name": "elastic-sink-test",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "TEST_A",
"connection.url": "http://localhost:9200/",
"transforms": "unwrap,key",
"transforms.unwrap.type": "io.debezium.transforms.UnwrapFromEnvelope",
"transforms.unwrap.drop.tombstones": "false",
"transforms.key.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.key.field": "SQ",
"key.ignore": "false",
"type.name": "TEST_A",
"behavior.on.null.values": "delete"
}
}
When adding these the kafka connect I/O is working hard and has over 40GB input see image below:
In the kafka logs it looks like its going through all the tables. Here is one of the table logs:
2021-06-17 10:20:10,414 - INFO [data-plane-kafka-request-handler-5:Logging#66] - [Partition MyServer.dbo.TemplateGroup-0 broker=1] Log loaded for partition MyServer.dbo.TemplateGroup-0 with initial high watermark 0
2021-06-17 10:20:10,509 - INFO [data-plane-kafka-request-handler-3:Logging#66] - Creating topic MyServer.dbo.TemplateMeter with configuration {} and initial partition assignment Map(0 -> ArrayBuffer(1))
2021-06-17 10:20:10,516 - INFO [data-plane-kafka-request-handler-3:Logging#66] - [KafkaApi-1] Auto creation of topic MyServer.dbo.TemplateMeter with 1 partitions and replication factor 1 is successful
2021-06-17 10:20:10,526 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(MyServer.dbo.TemplateMeter-0)
2021-06-17 10:20:10,528 - INFO [data-plane-kafka-request-handler-7:Logging#66] - [Log partition=MyServer.dbo.TemplateMeter-0, dir=/kafka/data/1] Loading producer state till offset 0 with message format version 2
The database is only 2GB. I'm not sure why it has so high input.
No test_a index was created in elasticsearch when running this command:
curl http://localhost:9200/_aliases?pretty=true
Does anyone know how I troubleshoot from here or point me to the right direction?
Thanks in advance!

how I troubleshoot from here
docker compose logs?
Modify the log4j.properties of Kafka Connect and/or Elasitcsearch processes to get more logs?
Use a regular Kafka consumer to see if data is actually read into the TEST_A topic?
in the "docker-compose-es.yaml" ....
If Debezium is running in a container, then Elasticsearch is not available at localhost:9200
Change that value to http://elastic:9200, like shown in the es-sink.json

Export Data from Hadoop using sql-spark-connector (Apache)

I am trying to export data from Hadoop to MS SQL using Apache Spark SQL Connector as instructed here sql-spark-connector which fails with exception java.lang.NoSuchMethodError: com.microsoft.sqlserver.jdbc.SQLServerBulkCopy.writeToServer (Lcom/microsoft/sqlserver/jdbc/ISQLServerBulkRecord;)V
According to official documentation Supported Versions
My Development Environment:
Hadoop Version: 2.7.0
Spark Version: 2.4.5
Scala Version: 2.11.12
MS SQL Version: 2016
My Code:
package com.company.test
import org.apache.spark.sql.SparkSession
object TestETL {
def main(args:Array[String]):Unit = {
val spark:SparkSession = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
// create DataFrame
val export_df = Seq(1,2,3).toDF("id")
export_df.show(5)
// Connection String
val server_name = "jdbc:sqlserver://ip_address:port"
val database_name = "database"
val url = server_name + ";" + "databaseName=" + database_name + ";"
export_df.write
.format("com.microsoft.sqlserver.jdbc.spark")
.mode("append")
.option("url", url)
.option("dbtable", "export_test")
.option("user", "username")
.option("password", "password")
.save()
}
}
My SBT
build.sbt
Command line argument I executed
/mapr/abc.company.com/user/dir/spark-2.4.5/bin/spark-submit --class com.company.test.TestETL /mapr/abc.company.com/user/dir/project/TestSparkSqlConnector.jar
JDBC Exception
I de-compiled the mssql-jdbc-8.2.0.jre8.jar to check if it is missing the SQLServerBulkCopy.writeToServer method implementation but that doesn't see to be the case.
Any insights on how I can fix this?

it is a compatability error please to reffer to this link it will explain the error or just choose compatible versions. gitHub link

invoke-sqlcmd - can I change the ApplicationName?

I'm running queries against SQL servers using invoke-sqlcmd and invoke-sqlcmd2.
Is there a way to change the ApplicationName that it runs as? When I run a profiler trace, I see the queries are run by ".Net SqlClient Data Provider", and I'd like to change that.
Any help greatly appreciated

Okay, futzed with the invoke-sqlcmd2 script by Chad Miller and came up with this:
Line 45, after "datarow" I added a comma, then:
[Parameter(Position=9, Mandatory=$false)] [string]$ApplicationName='Powershell'
Then modified the connection strings (about line 54):
if ($Username)
{ $ConnectionString = "Server={0};Database={1};User ID={2};Password={3};Application Name={5};Trusted_Connection=False;Connect Timeout={4}" -f $ServerInstance,$Database,$Username,$Password,$ConnectionTimeout,$ApplicationName }
else
{ $ConnectionString = "Server={0};Database={1};Integrated Security=True;Application Name={3};Connect Timeout={2}" -f $ServerInstance,$Database,$ConnectionTimeout,$ApplicationName }
The default Appname is now "Powershell", but can be changed by using the -ApplicationName parameter.

Include Application Name=MyAppName; in your connection string, and that will be what shows up in the profiler.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Pyspark on EMR with JDBC task hangs forever - sql-server

Related

Zeppelin Python Flink cannot print to console

How to use iteration data with newman

Docker, Debezium not streaming data from mssql to elasticsearch

Export Data from Hadoop using sql-spark-connector (Apache)

invoke-sqlcmd - can I change the ApplicationName?

Categories

Resources