Azure Databricks Spark DataFrame fails to insert into MS SQL Server using the MS Spark JDBC connector when executor tries fewer than 4,096 records - sql-server

That's a title and a half, but it pretty much summarises my "problem".
I have an Azure Databricks workspace, and a an Azure Virtual Machine running SQL Server 2019 Developer. They're on the same VNET, and they can communicate nicely with each other. I can select rows very happily from the SQL Server, and some instances of inserts work really nicely too.
My scenario:
I have a spark table foo, containing any number of rows. Could be 1, could be 20m.
foo contains 19 fields.
The contents of foo needs to be inserted into a table on the SQL Server also called foo, in a database called bar, meaning my destination is bar.dbo.foo
I've got the com.microsoft.sqlserver.jdbc.spark connector configured on the cluster, and I connect using an IP, port, username and password.
My notebook cell of relevance:
df = spark.table("foo")
try:
url = "jdbc:sqlserver://ip:port"
table_name = "bar.dbo.foo"
username = "user"
password = "password"
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("append") \
.option("truncate",True) \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.option("queryTimeout", 120) \
.option("tableLock",True) \
.option("numPartitions",1) \
.save()
except ValueError as error :
print("Connector write failed", error)
If I prepare foo to contain 10,000 rows, I can run this script time and time again, and it succeeds every time.
As the rows start dropping down, the Executor occasionally tries to process 4,096 rows in a task. As soon as it tries to do 4,096 in a task, weird things happen.
For example, having created foo to contain 5,000 rows and executing the code, this is the task information:
Index Task Id Attempt Status Executor ID Host Duration Input Size/Records Errors
0 660 0 FAILED 0 10.139.64.6 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 661 1 FAILED 3 10.139.64.8 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 662 2 FAILED 3 10.139.64.8 40s 261.3 KiB / 4096 com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
0 663 3 SUCCESS 1 10.139.64.5 0.4s 261.3 KiB / 5000
I don't fully understand why it fails after 40 seconds. Our timeouts are set to 600 seconds on the SQL box, and the query timeout in the script is 120 seconds.
Every time the Executor does more than 4,096 rows, it succeeds. This is true regardless of the size of the dataset. Sometimes it tries to do 4,096 rows on 100k row sets, fails, and then changes the records in the set to 100k and it immediately succeeds.
When the set is smaller than 4,096, the execution will typically generate one message:
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed
and then immediately work successfully having moved onto the next executor.
On the SQL Server itself, I see ASYNC_NETWORK_IO as the wait using Adam Mechanic's sp_whoisactive. This wait persists for the full duration of the 40s attempt. It looks like at 40s there's an immediate abandonment of the attempt, and a new connection is created - consistent with the messages I see from the task information.
Additionally, when looking at the statements, I note that it's doing ROWS_PER_BATCH = 1000 regardless of the original number of rows. I can't see any way of changing that in the docs, but I tried rowsPerBatch in the option for the df, but didn't appear to make a difference - still showing the 1000 value.
I've been running this with lots of different amounts of rows in foo - and when the total rows is greater than 4,096 my testing suggests that the spark executor succeeds if it tries a number of records that exceeds 4,096. If I remove the numPartitions, there are more attempts of 4,096 records, and so I see more failures.
Weirdly, if I cancel a query that appears to be running for longer than 10s, and immediately retry it - if the number of rows in foo is != 4,096, it seems to succeed every time. My sample is obviously pretty small - tens of attempts.
Is there a limitation I'm not familiar with here? What's the magic of 4,096?
In discussing this with my friend, we're wondering whether there is some form of implicit type conversions happening in the arrays when they're <4,096 records, which causes delays somehow.
I'm at quite a loss on this one - and wondering whether I just need to check the length of the DF before attempting the transfer - doing an iterative cursor in PYODBC for fewer rows, and sticking to the JDBC connector for larger numbers of rows. It seems like it shouldn't be needed!
Many thanks,
Johan

Related

Increase the lock timeout with sqlite, and what is the default values?

Well known issue when many clients query on a sqlite database : database is locked
I would like to inclease the delay to wait (in ms) for lock release on linux, to get rid of this error.
From sqlite-command, I can use for example (4 sec):
sqlite> .timeout 4000
sqlite>
I've started many processes which make select/insert/delete, and if I don't set this value with sqlite-command, I sometimes get :
sqlite> select * from items where code like '%30';
Error: database is locked
sqlite>
So what is the default value for .timeout ?
In Perl 5.10 programs, I also get sometimes this error, despite the default value seems to be 30.000 (so 30 sec, not documented).
Did programs actually waited for 30 sec before this error ? If yes, this seems crasy, there is at least a little moment where the database is free even if many other processes are running on this database
my $dbh = DBI->connect($database,"","") or die "cannot connect $DBI::errstr";
my $to = $dbh->sqlite_busy_timeout(); # $to get the value 30000
Thanks!
The default busy timeout for DBD::Sqlite is defined in dbdimp.h as 30000 milliseconds. You can change it with $dbh->sqlite_busy_timeout($ms);.
The sqlite3 command line shell has the normal Sqlite default of 0; that is to say, no timeout. If the database is locked, it errors right away. You can change it with .timeout ms or pragma busy_timeout=ms;.
The timeout works as follows:
The handler will sleep multiple times until at least "ms" milliseconds of sleeping have accumulated. After at least "ms" milliseconds of sleeping, the handler returns 0 which causes sqlite3_step() to return SQLITE_BUSY.
If you get a busy database error even with a 30 second timeout, you just got unlucky as to when attempts to acquire a lock were made on a heavily used database file (or something is running a really slow query). You might look into WAL mode if not already using it.

Server out-of-memory issue when using RJDBC in paralel computing environment

I have an R server with 16 cores and 8Gb ram that initializes a local SNOW cluster of, say, 10 workers. Each worker downloads a series of datasets from a Microsoft SQL server, merges them on some key, then runs analyses on the dataset before writing the results to the SQL server. The connection between the workers and the SQL server runs through a RJDBC connection. When multiple workers are getting data from the SQL server, ram usage explodes and the R server crashes.
The strange thing is that the ram usage by a worker loading in data seems disproportionally large compared to the size of the loaded dataset. Each dataset has about 8000 rows and 6500 columns. This translates to about 20MB when saved as an R object on disk and about 160MB when saved as a comma-delimited file. Yet, the ram usage of the R session is about 2,3 GB.
Here is an overview of the code (some typographical changes to improve readability):
Establish connection using RJDBC:
require("RJDBC")
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<username>","<pass>")
After this there is some code that sorts the function input vector requestedDataSets with names of all tables to query by number of records, such that we load the datasets from largest to smallest:
nrow.to.merge <- rep(0, length(requestedDataSets))
for(d in 1:length(requestedDataSets)){
nrow.to.merge[d] <- dbGetQuery(con, paste0("select count(*) from",requestedDataSets[d]))[1,1]
}
merge.order <- order(nrow.to.merge,decreasing = T)
We then go through the requestedDatasets vector and load and/or merge the data:
for(d in merge.order){
# force reconnect to SQL server
drv <- JDBC("com.microsoft.sqlserver.jdbc.SQLServerDriver","sqljdbc4.jar")
try(dbDisconnect(con), silent = T)
con <<- dbConnect(drv, "jdbc:sqlserver://<some.ip>","<user>","<pass>")
# remove the to.merge object
rm(complete.data.to.merge)
# force garbage collection
gc()
jgc()
# ask database for dataset d
complete.data.to.merge <- dbGetQuery(con, paste0("select * from",requestedDataSets[d]))
# first dataset
if (d == merge.order[1]){
complete.data <- complete.data.to.merge
colnames(complete.data)[colnames(complete.data) == "key"] <- "key_1"
}
# later dataset
else {
complete.data <- merge(
x = complete.data,
y = complete.data.to.merge,
by.x = "key_1", by.y = "key", all.x=T)
}
}
return(complete.data)
When I run this code on a serie of twelve datasets, the number of rows/columns of the complete.data object is as expected, so it is unlikely the merge call somehow blows up the usage. For the twelve iterations memory.size() returns 1178, 1364, 1500, 1662, 1656, 1925, 1835, 1987, 2106, 2130, 2217, and 2361. Which, again, is strange as the dataset at the end is at most 162 MB...
As you can see in the code above I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection jgc <- function(){.jcall("java/lang/System", method = "gc")}). I've also tried merging the data SQL-server-side, but then I run into number of columns constraints.
It vexes me that the RAM usage is so much bigger than the dataset that is eventually created, leading me to believe there is some sort of buffer/heap that is overflowing... but I seem unable to find it.
Any advice on how to resolve this issue would be greatly appreciated. Let me know if (parts of) my problem description are vague or if you require more information.
Thanks.
This answer is more of a glorified comment. Simply because the data being processed on one node only requires 160MB does not mean that the amount of memory needed to process it is 160MB. Many algorithms require O(n^2) storage space, which would be be in the GB for your chunk of data. So I actually don't see anything here which is unsurprising.
I've already tried a couple of fixes like calling GC(), JGC() (which is a function to force a Java garbage collection...
You can't force a garbage collection in Java, calling System.gc() only politely asks the JVM to do a garbage collection, but it is free to ignore the request if it wants. In any case, the JVM usually optimizes garbage collection well on its own, and I doubt this is your bottleneck. More likely, you are simply hitting on the overhead which R needs to crunch your data.

How can I debug problems with warehouse creation?

When trying to create a warehouse from the Cloudant dashboard, sometimes the process fails with an error dialog. Other times, the warehouse extraction stays in a state of triggered even after hours.
How can I debug this? For example is there an API I can call to see what is going on?
Take a look inside the document inside the _warehouser database, and look for the warehouser_error_message element. For example:
"warehouser_error_message": "Exception occurred while creating table.
[SQL0670N The statement failed because the row size of the
resulting table would have exceeded the row size limit. Row size
limit: \"\". Table space name: \"\". Resulting row size: \"\".
com.ibm.db2.jcc.am.SqlException: DB2 SQL Error: SQLCODE=-670,
SQLSTATE=54010, SQLERRMC=32677;;34593, DRIVER=4.18.60]"
The warehouser error message usually gives you enough information to debug the problem.
You can view the _warehouser document in the Cloudant dashboard or use the API, e.g.
export cl_username='<your_cloudant_account>'
curl -s -u $cl_username -p \
https://$cl_username.cloudant.com/_warehouser/_all_docs?include_docs=true \
| jq [.warehouse_error_code]

QODBS and SQL Server query performance

My application makes about 5 queries per second to a SQL Server database. Each query results in 1500 rows on average. The application is written on C++/QT, database operations are implemented using QODBC driver. I determined that query processing takes about 25 ms, but fetching the result - 800 ms. Here is how code querying the data base looks like
QSqlQuery query(db)
query.prepare(queryStr);
query.setForwardOnly(true);
if(query.exec())
{
while( query.next() )
{
int v = query.value(0).toInt();
.....
}
}
How to optimize result fetching?
This does not directly answer your question as I haven't used QT in years. In the actual ODBC API you can often speed up the retrieval of rows by setting SQL_ATTR_ROW_ARRAY_SIZE to N then each call to SQLFetch returns N rows at once. I took a look at SqlQuery in qt and could not see a way to do this but it may be something you could look in to with QT or simply write to the ODBC API directly. You can find an example at Preparing to Return Multiple Rows

Find long running query on Informix?

How can you find out what are the long running queries are on Informix database server? I have a query that is using up the CPU and want to find out what the query is.
If the query is currently running watch the onstat -g act -r 1 output and look for items with an rstcb that is not 0
Running threads:
tid tcb rstcb prty status vp-class name
106 c0000000d4860950 0 2 running 107soc soctcppoll
107 c0000000d4881950 0 2 running 108soc soctcppoll
564457 c0000000d7f28250 c0000000d7afcf20 2 running 1cpu CDRD_10
In this example the third row is what is currently running. If you have multiple rows with non-zero rstcb values then watch for a bit looking for the one that is always or almost always there. That is most likely the session that your looking for.
c0000000d7afcf20 is the address that we're interested in for this example.
Use onstat -u | grep c0000000d7afcf20 to find the session
c0000000d7afcf20 Y--P--- 22887 informix - c0000000d5b0abd0 0 5 14060 3811
This gives you the session id which in our example is 22887. Use onstat -g ses 22887
to list info about that session. In my example it's a system session so there's nothing to see in the onstat -g ses output.
That's because the suggested answer is for DB2, not Informix.
The sysmaster database (a virtual relational database of Informix shared memory) will probably contain the information you seek. These pages might help you get started:
http://docs.rinet.ru/InforSmes/ch22/ch22.htm
http://www.informix.com.ua/articles/sysmast/sysmast.htm
Okay it took me a bit to work out how to connect to sysmaster. The JDBC connection string is:
jdbc:informix-sqli://dbserver.local:1526/sysmaster:INFORMIXSERVER=mydatabase
Where the port number is the same as when you are connecting to the actual database. That is if your connection string is:
jdbc:informix-sqli://database:1541/crm:INFORMIXSERVER=crmlive
Then the sysmaster connection string is:
jdbc:informix-sqli://database:1541/sysmaster:INFORMIXSERVER=crmlive
Also found this wiki page that contains a number of SQL queries for operating on the sysmaster tables.
SELECT ELAPSED_TIME_MIN,SUBSTR(AUTHID,1,10) AS AUTH_ID,
AGENT_ID, APPL_STATUS,SUBSTR(STMT_TEXT,1,20) AS SQL_TEXT
FROM SYSIBMADM.LONG_RUNNING_SQL
WHERE ELAPSED_TIME_MIN > 0
ORDER BY ELAPSED_TIME_MIN DESC
Credit: SQL to View Long Running Queries

Resources