Download empty string as empty string (not null) from Snowflake to Databricks - snowflake-cloud-data-platform

I would like to load the data from Snowflake table into the Databricks pyspark cluster. In order to do it I ran:
a = spark.read \
.format("snowflake") \
.options(**cfg['sf_connection_options']) \
.option("dbtable", <db.schema.table>)\
.load()
Unfortunately if the string is empty it is downloaded and treated as Null.
Is there any way to properly treat: Nulls as Nulls and empty string just as empty strings?

Try to provide your schema manually , assume below are the columns in your table/DF
custom_schema = StructType([
StructField("Year", IntegerType(),True),
StructField("Month", IntegerType(),True),
StructField("DayofMonth",IntegerType(),True),
StructField("DayOfWeek", IntegerType(),True),
StructField("DepTime", IntegerType(),True),
StructField("CRSDepTime", IntegerType(),True)])
So , instead of this
.option("dbtable", <db.schema.table>)\
use
.schema(customSchema)
Should be like ---
a = spark.read \
.format("snowflake") \
.schema(customSchema)
.options(**cfg['sf_connection_options']) \
.load()

Related

Upsert data into SQL Server from pyspark code

I have a pyspark dataframe that I wanted to upsert into a SQL Server table. I was looking at df.write modes and I do not see any upsert option. Therefore I am trying to write the dataframe into HDFS as parquet format and then sqoop the file using --update-mode allowinsert. However, I keep getting the following error:
Got exception in update thread: com.microsoft.sqlserver.jdbc.SQLServerException: One or more values is out of range of values for the datetime2 SQL Server data type
I tried to write the file as csv just to check if the contents/timestamp in the file is out of range, however the timestamps are correct.
Has anybody been able to write the pyspark dataframe into a SQL Server table?
Here's the function to write DF to HDFS:
def write_df_to_hdfs(df, filename, hdfs_location_working):
"""
Function to write delta records dataframe to HDFS
"""
logging.info("Started writing delta records dataframe to hdfs")
df.write.save(hdfs_location_working, format='parquet', mode='append', timestampFormat='YYYY-MM-dd hh:mm:ss.SSS',emptyValue="")
logging.info("Successfully written delta records dataframe to hdfs")
Also, here's the sqoop command I m using to write that data into SQL Server:
sqoop export -Dmapreduce.map.memory.mb=4096 -Dmapreduce.map.java.opts=-Xmx3000m -Dmapred.job.queuename=ici -Dsqoop.export.records.per.statement=30 -Dsqoop.export.statements.per.transaction=30 -libjars /opt/cloudera/parcels/CDH-7.1.6-1.cdh7.1.6.p6.12486751/lib/sqoop/lib/sqljdbc.jar --connect "jdbc:sqlserver://*******.hosts.cloud.ford.com;databaseName=SQTDIAPM_AM;schema=dbo;" \
--username 'user' \
--password 'pwd' \
--export-dir <HDFS Path> \
--table <tablename> \
--input-null-string '""' \
--input-null-string '\\N' \
--input-null-non-string '\\N' \
--update-key col1,col2 \
--update-mode allowinsert \
--batch \
-m 40 \
--verbose
Appreciate your help!

Importing data from SQL Server to HIVE using SQOOP

I am able to successfully import data from SQL Server to HDFS using sqoop. However, when it tries to link to HIVE I get an error. I am not sure I understand the error correctly
sudo -u hdfs sqoop import \
-Dorg.apache.sqoop.splitter.allow_text_splitter=true \
--connect "jdbc:sqlserver://XX.XX.X.X:1433;instanceName=data-engr-sql-svr; databaseName=AdventureWorks2019" \
--username sa \
--password XXXXXXXX \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--warehouse-dir "/user/hive/warehouse/AdventureWorks2019.db" \
--hive-import \
--create-hive-table \
--fields-terminated-by ',' \
--hive-table AdventureWorks2019.Production.TransactionHistory \
--table Production.TransactionHistory \
--split-by TransactionID \
-- --schema Production
I don't know how to handle schemas, most of the tutorial uses a dummy database without proper schemas which are not helpful.
Error
21/03/31 08:52:47 INFO conf.HiveConf: Using the default value passed in for log id: 95e2b831-cfe5-4108-be0f-0df1d9a8797e
21/03/31 08:52:47 INFO session.SessionState: Updating thread name to 95e2b831-cfe5-4108-be0f-0df1d9a8797e main
21/03/31 08:52:47 INFO conf.HiveConf: Using the default value passed in for log id: 95e2b831-cfe5-4108-be0f-0df1d9a8797e
21/03/31 08:52:47 INFO ql.Driver: Compiling command(queryId=hdfs_20210331085247_050638e8-593a-4d01-8020-c40b7db8e66a): CREATE TABLE IF NOT EXISTS AdventureWorks2019.Production.TransactionHistory ( TransactionID INT, ProductID INT, ReferenceOrderID INT, ReferenceOrderLineID INT, TransactionDate STRING, TransactionType STRING, Quantity INT, ActualCost DOUBLE, ModifiedDate STRING) COMMENT 'Imported by sqoop on 2021/03/31 08:52:45' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' LINES TERMINATED BY '\012' STORED AS TEXTFILE
21/03/31 08:52:49 INFO hive.metastore: HMS client filtering is enabled.
21/03/31 08:52:49 INFO hive.metastore: Trying to connect to metastore with URI thrift://cnt7-naya-cdh63:9083
21/03/31 08:52:49 INFO hive.metastore: Opened a connection to metastore, current connections: 1
21/03/31 08:52:49 INFO hive.metastore: Connected to metastore.
21/03/31 08:52:49 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
FAILED: SemanticException [Error 10255]: Invalid table name AdventureWorks2019.Production.TransactionHistory
21/03/31 08:52:49 ERROR ql.Driver: FAILED: SemanticException [Error 10255]: Invalid table name AdventureWorks2019.Production.TransactionHistory
There is no such thing as schema inside the database in Hive. Database and schema mean the same thing and can be used interchangeably.
So, the bug is in using database.schema.table. Use database.table in Hive.
Read the documentation: Create/Drop/Alter/UseDatabase

Pyspark Dataframe unable to read real datatype data correctly from SQLServer

I am reading data in Spark DataFrame using below code but getting unexpected result.
Source SQLSERVER Table
Code used to read data from SQLServer
jdbcDF = spark.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "schema.tablename") \
.option("user", "user") \
.option("password", "password") \
.load()
Output from above code
I am expecting below output.
tl;dr - This is expected due to loss of precision beyond 4 bytes of data on REAL type - REAL is an approximate numeric type - Spark then takes that imprecise value and interprets it as DoubleType. If you wish to have exact matches, you should look at exact-precision types in your underlying databse, such as DECIMAL which should have up to 17 storage bytes (default precision of 38), which will get picked up as DecimalType in Spark (see here https://learn.microsoft.com/en-us/sql/connect/jdbc/understanding-data-type-differences?view=sql-server-ver15)
Longer read on how I tested this
I reconstructed your problem using MSSQL and Spark on Docker and can confirm that this appears to be the expected behaviour.
Reviewing the documentation on REAL type, this stands out in particular:
Real data can hold a value 4 bytes in size, meaning it has 7 digits of precision (the number of digits to the right of the decimal point). It's also a floating-point numeric that is identical to the floating point statement float(24).
This in essence means that your change of precision from 5.987691E+07 to 5.9876908E7 is expected (E notation is just showing differently on Spark and MSSQL). How the rest of the value beyond 7 digit precision is populated is not particularly deterministic - for example, while you got 5.987691E+07 interpreted as 5.9876908E7, my code gave me 5.9876912E7 (as you can see below) - this is due to loss of precision in REAL type after all 4 bytes of data are exhausted.
Now to retrace my steps (Docker steps are optional if you have MSSQL/Spark on host machine):
docker-compose up 2 containers, one with PySpark, one with MSSQL, expose all necessary ports (1433 for MSSQL, 8888 for Notebook on Spark side, etc.)
On host machine, run mssql -u sa -p <password> to get into your MSSQL instance.
Run the following commands:
mssql> create database test_db;
mssql> create table dbo.test_table (name varchar(100), value real);
mssql> insert into dbo.test_table values ('val1', 5.987691E+07);
mssql> insert into dbo.test_table values ('val2', 5.987691E+07);
mssql> insert into dbo.test_table values ('val3', 1.23456789E+07);
mssql> select * from dbo.test_table;
name value
---- --------
val1 59876912
val2 59876912
val3 12345679
Spark side, create relevant SparkSession with all jars (MSSQL JDBC driver can be found here https://www.microsoft.com/en-us/download/details.aspx?id=11774).
spark = (
SparkSession
.builder
.appName("MSSQL Connection")
.master("local[*]")
.config("spark.jars", "/etc/sqljdbc42.jar") # required to connect to MSSQL via JDBC
.getOrCreate()
)
Read the table as DataFrame from JDBC source
df = spark.read \
.format("jdbc") \
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", "jdbc:sqlserver://mssql:1433") \
.option("dbtable", "dbo.test_table") \
.option("user", "sa") \
.option("password", "<password>") \
.load()
Investigate df.
df.show()
+----+-----------+
|name| value|
+----+-----------+
|val1|5.9876912E7|
|val2|5.9876912E7|
|val3|1.2345679E7|
+----+-----------+
df.printSchema()
root
|-- name: string (nullable = true)
|-- value: double (nullable = true)
In order to get exact values to match, I believe you need to change the type in your underlying MSSQL database to something with higher precision, for example DECIMAL.

Can I change the datatype of the Spark dataframe columns that is being loaded to SQL Server as a table?

I am trying to read a Parquet file from Azure Data Lake using the following Pyspark code.
df= sqlContext.read.format("parquet")
.option("header", "true")
.option("inferSchema", "true")
.load("adl://xyz/abc.parquet")
df = df['Id','IsDeleted']
Now I would like to load this dataframe df as a table in sql dataware house using the following code:
df.write \
.format("com.databricks.spark.sqldw") \
.mode('overwrite') \
.option("url", sqlDwUrlSmall) \
.option("forward_spark_azure_storage_credentials", "true") \
.option("dbtable", "test111") \
.option("tempdir", tempDir) \
.save()
This creates a table dbo.test111 in the SQL Datawarehouse with datatypes:
Id(nvarchar(256),null)
IsDeleted(bit,null)
But I need these columns with different datatypes say char(255), varchar(128) in SQL Datawarehouse. How do I do this while loading the dataframe into SQL Dataware house?
I found a way can help you modify the column data type, but maybe could not achieve your want.
df.select(col("colname").cast(DataType))
Here is a blob about How to change column types in Spark SQL's DataFrame.
Maybe this can helps you.
The only supported data types on Spark SQL are given
[https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/types/package-summary.html][1]
String types will in fact be turned into VARCHAR with unspecified length.
Spark SQL does not have VARCHAR(n) data type.
You should be able to do something like below
import org.apache.spark.sql.types._
val df =
df.withColumn("Id_mod", df.Id.cast(StringType))
.withColumn("IsDeleted_mod", df.IsDeleted.cast(StringType))
.drop("Id")
.drop("IsDeleted")
.withColumnRenamed("Id_mod", "Id")
.withColumnRenamed("IsDeleted_mod", "IsDeleted")
//Replace StringType with Any supported desired type

Does Sqoop use Reducer?

Does Sqoop run reducer if there is a join/aggregation performed in the select query given with a --query parameter? Or is there any case in Sqoop where both mappers and reducers run?
Documentation specifies that each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop.
$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/foo/joinresults
In the example above, how does the JOIN take place where the table is first partitioned using $CONDITIONS?
Join/Computation will be executed on RDBMS and its result will be used by mapper to transfer to HDFS.
No reducer is involved
With --query parameter, you need to specify the --split-by parameter with the column that should be used for slicing
your data into multiple parallel map tasks. This parameter usually automatically defaults to
the primary key of the main table
$CONDITIONS will be automatically substituted this placeholder with the generated conditions specifying which slice of data to transfer
In your particular command, the sqoop does not use reducer.
However there are cases, when sqoop do use reducer. Check the below example taken from document here.
$ sqoop export \
-Dmapred.reduce.tasks=2
-Dpgbulkload.bin="/usr/local/bin/pg_bulkload" \
-Dpgbulkload.input.field.delim=$'\t' \
-Dpgbulkload.check.constraints="YES" \
-Dpgbulkload.parse.errors="INFINITE" \
-Dpgbulkload.duplicate.errors="INFINITE" \
--connect jdbc:postgresql://pgsql.example.net:5432/sqooptest \
--connection-manager org.apache.sqoop.manager.PGBulkloadManager \
--table test --username sqooptest --export-dir=/test -m 2

Resources