Cannot write non-nullable data to Azure Synapse - PySpark - sql-server

The Problem:
I am unable to write any dataframe that contains a non-nullable column to Azure Synapse's dedicated SQL pool.
Problem details:
I have a DataFrame with the following schema:
StructType(List(
StructField(key,StringType,false),
StructField(Col_1,StringType,true),
StructField(Col_2,IntegerType,true)))
When I try write this to my Azure Dedicated SQL pool with:
DataFrame.write.mode("overwrite").synapsesql("DB.SCHEMA.TEST_TABLE")
I receive the following error:
Py4JJavaError Traceback (most recent call last)
<ipython-input-65-362788a> in <module>
2 #Failed to validate staging directory path. Verify if there's external user action to cancel the staging spark job.
Band-aid Solution:
Using a code snippet from this Stack Question
def set_df_columns_nullable(spark, df, column_list, nullable=True):
for struct_field in df.schema:
if struct_field.name in column_list:
struct_field.nullable = nullable
df_mod = spark.createDataFrame(df.rdd, df.schema)
return df_mod
I am able to modify the "key" column to "nullable" after which the above write function will happily write the data to the table.
The Question:
What is the explanation here? Why won't the Synapse table accept non-nullable data from Spark?

Related

<MSDialect_pyodbc, TIMESTAMP> error when comparing tables in DVT - BigQuery and SQL Server

I'm trying to compare the table data between two databases (on-prem SQL Server and BigQuery). I'm currently using Data-Validation-Tool for that (DVT).
Using the instructions from Github (link: https://github.com/GoogleCloudPlatform/professional-services-data-validator/tree/develop/docs), I installed and created a connection for DVT.
When tried to compare two table from single source, it works fine and provide correct output. But when checked for two different source, it returns dtype: <MSDialect_pyodbc, TIMESTAMP> error.
Details:
I tried to validate over level count but still same error. Command -
data-validation validate column -sc wc_sql_conn -tc my_bq_conn -tbls EDW.dbo.TableName=project_id.dataset_name.TableName --primary-keys PK_Col_Name --count '*'
Also, When checked for schema level validation -
data-validation validate schema -sc wc_sql_conn -tc my_bq_conn -tbls EDW.dbo.TableName=project_id.dataset_name.TableName
I tried to add only INT/STRING columns in custom query and then comparing it.
Custom Query -
SELECT PK_Col_Name, Category FROM EDW.dbo.TableName
Similar custom query prepared for BQ and command for custom query comparison -
data-validation validate custom-query -sc wc_sql_conn -tc my_bq_conn -cqt 'column' -sqf sql_query.txt -tqf bq_query.txt -pk PK_Col_Name --count '*'
data-validation validate custom-query -sc wc_sql_conn -tc my_bq_conn -cqt 'row' -sqf sql_query.txt -tqf bq_query.txt -pk PK_Col_Name --count '*'
Even when using multiple approaches and the scenario doesn't involve a datetime/timestamp column, I typically only get one error -
NotImplementedError: Could not find signature for dtype: <MSDialect_pyodbc, TIMESTAMP>
I tried to google down the error, but no luck. Could someone please help me identify the error?
Additionally, there are no data-validation-tool or google-pso-data-validator tags available. If someone could add that, it may be used in the future and reach the right people.

Asynchronous cursor execution in Snowflake

(Submitting on behalf of a Snowflake user)
At the time of query execution on Snowflake, I need its query id. So I am using following code snippet:
cursor.execute(query, _no_results=True)
query_id = cursor.sfqid
cursor.query_result(query_id)
This code snippet working fine for small running queries. But for query which takes more than 40-45 seconds to execute, query_result function fails with KeyError u'rowtype'.
Stack trace:
File "snowflake/connector/cursor.py", line 631, in query_result
self._init_result_and_meta(data, _use_ijson)
File "snowflake/connector/cursor.py", line 591, in _init_result_and_meta
for column in data[u'rowtype']:
KeyError: u'rowtype'
Why would this error occur? How to solve this problem?
Any recommendations? Thanks!
The Snowflake Python Connector allows for async SQL execution by using ​cur.execute(sql, _no_results=True)​
This ​"fire and forget"​ style of SQL execution allows for the parent process to continue without waiting for the SQL command to complete (think long-running SQL that may time-out).
If this is used, many developers will write code that captures the unique Snowflake Query ID (like you have in your code) and then use that Query ID to ​"check back on the query status later"​, in some sort of looping process. When you check back to get the status, you can then get the results from that query_id using the result_scan( ) function.
https://docs.snowflake.net/manuals/sql-reference/functions/result_scan.html
I hope this helps...Rich

dplyr::tbl returning list instead of table

I am trying to use dplyr in RStudio to manipulate tables in MS SQL Server database. I successfully connected to the database using DBI, ODBC.
Code:
library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "myserver",
database = "ABC",
UID = "sqladmin",
PWD = "pwd",
port = '14333')
data <- tbl(con, "abc")
abc is a table within database ABC. The connection is successful (I am able look at the tables and fields) but dplyr::tbl is returning a list of 2 instead of returning the table abc. So data is a list instead of table. Where am I going wrong in this code?
The schema is ABC --> dbo --> abc
The code works as expected. What you’re seeing is simply a limitation of the type display in the RStudio data inspector: the actual type returned by tbl is an object of S3 class tbl_SQLiteConnection but it is implemented as a nested list (similar to how data.frames are implemented as lists of columns).
You will be able to work with your data as expected. You can also invoke as_tibble(data) to get a proper tibble back … but you don’t need to do that to work with it!
Building on #Konrad's answer, some additional considerations:
There is a distinction between a local dataframe and a remote dataframe.
remote_data <- tbl(connection, "database_table_name") creates a remote data frame. The data is stored in the source database, but R has a pointer to the database that can be used for querying it.
You can load data from a remote dataframe into R memory using local_data <- collect(remote_data) or local_data <- as.dataframe(remote_data). Depending on the size of your remote data this can be very slow, or can crash R due to lack of memory.
Both the local and the remote dataframe are dataframes. class(remote_data) and class(local_data) will return the expected type: tbl (a tibble). The remote dataframe is implemented as a list because it needs to store different info from the local dataframe. Try head(remote_data, 100) to view the first 100 rows of the remote table.
Remote dataframes can be manipulated using (most) standard dplyr commands. These are translated by dbplyr into the corresponding database syntax and executed on the database.
A good use of remote tables is to perform initial filtering and summarizing of a large table before pulling the summarized data into local R memory for further processing. For example:
library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
con <- dbConnect(odbc(),
Driver = "SQL Server",
Server = "server_name",
database = "database_name",
UID = "sqladmin",
PWD = "pwd",
port = '14333')
remote_data <- tbl(con, "database_table_name")
# preview remote table
head(remote_data)
# summarize
prepared_data <- remote_data %>%
filter(column_1 >= 10) %>%
group_by(column_2) %>%
summarize(total = sum(column_2), .groups = 'drop')
# check query of prepared table
show_query(prepared_data)
# draw summarised table into local memory
local_summarised_data <- collect(prepared_data)
Edit: Some additional points following #mykonos' question:
Storage of remote tables works differently from storage of local tables.
In R the command prepared_data <- local_table %>% mutate(new = 2 * old) creates a separate copy of the source data (it is a little more complex than this because of lazy evaluation under the hood, but this is a sufficient way to think of it). If you were removing objects from your workspace with rm you would have to remote both copies.
However, remote tables are not duplicate copies of the data on the database. The command prepared_data <- remote_table %>% mutate(new = 2 * old) creates a second remote table in R. This means we have two remote table objects in R both pointing back to the same database table (but in different ways).
Remote table definitions in R are defined by two components: The database connection and the query the produces the current table. When we manipulate a remote table in R (by default) all we are doing is changing the query. You can use show_query to check the query that is currently defined.
So, when we create remote_data <- tbl(con, "database_table_name") then remote_data is stored in R as the database connection and a query something like: SELECT * FROM database_table_name.
When we create prepared_data <- remote_table %>% mutate(new = 2 * old) then prepared_data is stored in R as the same database connection as remote_table but a different query. Something like: SELECT *, new = 2*old FROM database_table_name.
Changing remote table connections in R does not affect the database. Because manipulations of remote connections only changes the query, working with database tables via dbplyr can not change the source data. This is consistent with the design and intention of databases.
If you want to write to a database from R, there are a range of ways to do it. There are a number of questions tagged dbplyr on SO that ask about this.
One downside of this approach is that lengthy manipulations of remote tables can perform poorly because the database has to do significant work to show the final result. I recommend you explore options to write temporary / intermediate tables if this is a concern.

How to indicate the database in SparkSQL over Hive in Spark 1.3

I have a simple Scala code that retrieves data from the Hive database and creates an RDD out of the result set. It works fine with HiveContext. The code is similar to this:
val hc = new HiveContext(sc)
val mySql = "select PRODUCT_CODE, DATA_UNIT from account"
hc.sql("use myDatabase")
val rdd = hc.sql(mySql).rdd
The version of Spark that I'm using is 1.3. The problem is that the default setting for hive.execution.engine is 'mr' that makes Hive to use MapReduce which is slow. Unfortunately I can't force it to use "spark".
I tried to use SQLContext by replacing hc = new SQLContext(sc) to see if performance will improve. With this change the line
hc.sql("use myDatabase")
is throwing the following exception:
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``insert'' expected but identifier use found
use myDatabase
^
The Spark 1.3 documentation says that SparkSQL can work with Hive tables. My question is how to indicate that I want to use a certain database instead of the default one.
use database
is supported in later Spark versions
https://docs.databricks.com/spark/latest/spark-sql/language-manual/use-database.html
You need to put the statement in two separate spark.sql calls like this:
spark.sql("use mydb")
spark.sql("select * from mytab_in_mydb").show
Go back to creating the HiveContext. The hive context gives you the ability to create a dataframe using Hive's metastore. Spark only uses the metastore from hive, and doesn't use hive as a processing engine to retrieve the data. So when you create the df using your sql query, its really just asking hive's metastore "Where is the data, and whats the format of the data"
Spark takes that information, and will run process against the underlying data on the HDFS. So Spark is executing the query, not hive.
When you create the sqlContext, its removing the link between Spark and the Hive metastore, so the error is saying it doesn't understand what you want to do.
I have not been able to implement the use databale command, but here is a workaround to use the desired database:
spark-shell --queue QUEUENAME;
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val res2 = sqlContext.sql("select count(1) from DB_NAME.TABLE_NAME")
res2.collect()

SOQL - Convert Date To Owner Locale

We use the DBAmp for integrating Salesforce.com with SQL Server (which basically adds a linked server), and are running queries against our SF data using OPENQUERY.
I'm trying to do some reporting against opportunities and want to return the created date of the opportunity in the opportunity owners local date time (i.e. the date time the user will see in salesforce).
Our dbamp configuration forces the dates to be UTC.
I stumbled across a date function (in the Salesforce documentation) that I thought might be some help, but I get an error when I try an use it so can't prove it, below is the example useage for the convertTimezone function:
SELECT HOUR_IN_DAY(convertTimezone(CreatedDate)), SUM(Amount)
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))
Below is the error returned:
OLE DB provider "DBAmp.DBAmp" for linked server "SALESFORCE" returned message "Error 13005 : Error translating SQL statement: line 1:37: expecting "from", found '('".
Msg 7350, Level 16, State 2, Line 1
Cannot get the column information from OLE DB provider "DBAmp.DBAmp" for linked server "SALESFORCE".
Can you not use SOQL functions in OPENQUERY as below?
SELECT
*
FROM
OPENQUERY(SALESFORCE,'
SELECT HOUR_IN_DAY(convertTimezone(CreatedDate)), SUM(Amount)
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))')
UPDATE:
I've just had some correspondence with Bill Emerson (I believe he is the creator of the DBAmp Integration Tool):
You should be able to use SOQL functions so I am not sure why you are
getting the parsing failure. I'll setup a test case and report back.
I'll update the post again when I hear back. Thanks
A new version of DBAmp (2.14.4) has just been released that fixes the issue with using ConvertTimezone in openquery.
Version 2.14.4
Code modified for better memory utilization
Added support for API 24.0 (SPRING 12)
Fixed issue with embedded question marks in string literals
Fixed issue with using ConvertTimezone in openquery
Fixed issue with "Invalid Numeric" when using aggregate functions in openquery
I'm fairly sure that because DBAmp uses SQL and not SOQL, SOQL functions would not be available, sorry.
You would need to expose this data some other way. Perhaps it's possible with a Salesforce report, web-service, or compiling the data through the program you are using to access the (DBAmp) SQL Server.
If you were to create a Salesforce web service, the following example might be helpful.
global class MyWebService
{
webservice static AggregateResult MyWebServiceMethod()
{
AggregateResult ar = [
SELECT
HOUR_IN_DAY(convertTimezone(CreatedDate)) Hour,
SUM(Amount) Amount
FROM Opportunity
GROUP BY HOUR_IN_DAY(convertTimezone(CreatedDate))];
system.debug(ar);
return ar;
}
}

Resources