SQL Server R Services - outputting data to database table, performance - sql-server

I noticed that rx* functions (eg. rxKmeans, rxDataStep) insert data to SQL Server table in a row-by-row fashion when outFile parameter is set to a table. This is obviously very slow and something like bulk-insert would be desirable instead. Can this be obtained and how to do it?
Currently I am trying to insert about 14 mln rows to a table by invoking rxKmeans function with outFile parameter specified and it takes about 20 minutes.
Example of my code:
clustersLogInitialPD <- rxKmeans(formula = ~LogInitialPD
,data = inDataSource
,algorithm = "Lloyd"
,centers = start_c
,maxIterations = 1
,outFile = sqlLogPDClustersDS
,outColName = "ClusterNo"
,overwrite = TRUE
,writeModelVars = TRUE
,extraVarsToWrite = c("LoadsetId", "ExposureId")
,reportProgress = 0
)
sqlLogPDClustersDS points to a table in my database.
I am working on SQL Server 2016 SP1 with R Services installed and configured (both in-database and standalone). Generally everything works fine except this terrible performance of writing rows to database tables from R scrip.
Any comments will be greatly appreciated.

I brought this up on this Microsoft R MSDN forum thread recently as well.
I ran into this problem and I'm aware of 2 reasonable solutions.
Use sp_execute_external_script output data frame option
/* Time writing data back to SQL from R */
SET STATISTICS TIME ON
IF object_id('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp
CREATE TABLE #tmp (a FLOAT NOT NULL, b INT NOT NULL );
DECLARE #numRows INT = 1000000
INSERT INTO #tmp (a, b)
EXECUTE sys.sp_execute_external_script
#language = N'R'
,#script = N'OutputDataSet <- data.frame(a=rnorm(numRows), b=1)'
,#input_data_1 = N''
, #output_data_1_name = N'OutputDataSet'
,#params = N' #numRows INT'
,#numRows = #numRows
GO
-- ~7-8 seconds for 1 million row insert (2 columns) on my server
-- rxDataStep for 100K rows takes ~45 seconds on my server
Use SQL Server bcp.exe or BULK INSERT (only if running on the SQL box itself) after first writing a data frame to a flat file
I've written some code that does this but it's not very polished and I've had to leave sections with <<<VARIABLE>>> that assume connection string information (server, database, schema, login, password). If you find this useful or any bugs please let me know. I'd also love to see Microsoft incorporate the ability to save data from R back to SQL Server using BCP APIs. Solution (1) above only works via sp_execute_external_script. Basic testing also leads me to believe that bcp.exe can be roughly twice as fast as option (1) for a million rows. BCP will result in a minimally-logged SQL operation so I'd expect it to be faster.
# Creates a bcp file format function needed to insert data into a table.
# This should be run one-off during code development to generate the format needed for a given task and saved in a the .R file that uses it
createBcpFormatFile <- function(formatFileName, tableName) {
# Command to generate BCP file format for importing data into SQL Server
# https://msdn.microsoft.com/en-us/library/ms162802.aspx
# format creates a format file based on the option specified (-n, -c, -w, or -N) and the table or view delimiters. When bulk copying data, the bcp command can refer to a format file, which saves you from re-entering format information interactively. The format option requires the -f option; creating an XML format file, also requires the -x option. For more information, see Create a Format File (SQL Server). You must specify nul as the value (format nul).
# -c Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
# -x Used with the format and -f format_file options, generates an XML-based format file instead of the default non-XML format file. The -x does not work when importing or exporting data. It generates an error if used without both format and -f format_file.
## Bob: -x not used because we currently target bcp version 8 (default odbc driver compatibility that is installed everywhere)
# -f If -f is used with the format option, the specified format_file is created for the specified table or view. To create an XML format file, also specify the -x option. For more information, see Create a Format File (SQL Server).
# -t field_term Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
# -S server_name [\instance_name] Specifies the instance of SQL Server to which to connect. If no server is specified, the bcp utility connects to the default instance of SQL Server on the local computer. This option is required when a bcp command is run from a remote computer on the network or a local named instance. To connect to the default instance of SQL Server on a server, specify only server_name. To connect to a named instance of SQL Server, specify server_name\instance_name.
# -U login_id Specifies the login ID used to connect to SQL Server.
# -P -P password Specifies the password for the login ID. If this option is not used, the bcp command prompts for a password. If this option is used at the end of the command prompt without a password, bcp uses the default password (NULL).
bcpPath <- .pathToBcpExe()
parsedTableName <- parseName(tableName)
# We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
# -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])
bcpOptions <- paste0("format nul -c -f ", formatFileName, " -t, ", .bcpConnectionOptions())
commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)
result <- .bcpRunShellThrowErrors(commandToRun)
}
# Save a data frame (data) using file format (formatFilePath) to a table on the database (tableName)
bcpDataToTable <- function(data, formatFilePath, tableName) {
numRows <- nrow(data)
# write file to disk
ptm <- proc.time()
tmpFileName <- tempfile("bcp", tmpdir=getwd(), fileext=".csv")
write.table(data, file=tmpFileName, quote=FALSE, row.names=FALSE, col.names=FALSE, sep=",")
# Bob: note that one can make this significantly faster by switching over to use the readr package (readr::write_csv)
#readr::write_csv(data, tmpFileName, col_names=FALSE)
# bcp file to server time start
mid <- proc.time()
bcpPath <- .pathToBcpExe()
parsedTableName <- parseName(tableName)
# We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
# -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])
bcpOptions <- paste0(" in ", tmpFileName, " ", .bcpConnectionOptions(), " -f ", formatFilePath, " -h TABLOCK")
commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)
result <- .bcpRunShellThrowErrors(commandToRun)
cat(paste0("time to save dataset to disk (", numRows, " rows):\n"))
print(mid - ptm)
cat(paste0("overall time (", numRows, " rows):\n"))
proc.time() - ptm
unlink(tmpFileName)
}
# Examples:
# createBcpFormatFile("test2.fmt", "temp_bob")
# data <- data.frame(x=sample(1:40, 1000, replace=TRUE))
# bcpDataToTable(data, "test2.fmt", "test_bcp_1")
#####################
# #
# Private functions #
# #
#####################
# Path to bcp.exe. bcp.exe is currently from version 8 (SQL 2000); newer versions depend on newer SQL Server ODBC drivers and are harder to copy/paste distribute
.pathToBcpExe <- function() {
paste0(<<<bcpFolder>>>, "/bcp.exe")
}
# Function to convert warnings from shell into errors always
.bcpRunShellThrowErrors <- function(commandToRun) {
tryCatch({
shell(commandToRun)
}, warning=function(w) {
conditionMessageWithoutPassword <- gsub(<<<connectionStringSqlPassword>>>, "*****", conditionMessage(w), fixed=TRUE) # Do not print SQL passwords in errors
stop("Converted from warning: ", conditionMessageWithoutPassword)
})
}
# The connection options needed to establish a connection to the client database
.bcpConnectionOptions <- function() {
if (<<<useTrustedConnection>>>) {
return(paste0(" -S ", <<<databaseServer>>>, " -T"))
} else {
return(paste0(" -S ", <<<databaseServer>>>, " -U ", <<<connectionStringLogin>>>," -P ", <<<connectionStringSqlPassword>>>))
}
}
###################
# Other functions #
###################
# Mirrors SQL Server parseName function
parseName <- function(databaseObject) {
splitName <- strsplit(databaseObject, '.', fixed=TRUE)[[1]]
if (length(splitName)==3){
dbName <- splitName[1]
schemaName <- splitName[2]
tableName <- splitName[3]
} else if (length(splitName)==2){
dbName <- <<<databaseServer>>>
schemaName <- splitName[1]
tableName <- splitName[2]
} else if (length(splitName)==1){
dbName <- <<<databaseName>>>
schemaName <- ""
tableName <- splitName[1]
}
return(c(tableName=tableName, schemaName=schemaName, dbName=dbName))
}

Related

Trying to Export Tables to CSVs from SQL Server

I ran the following script to try to get all tables in my DB exported (trying to backup the data in CSVs).
SELECT 'sqlcmd -S . -d '+DB_NAME()+' -E -s, -W -Q "SET NOCOUNT ON; SELECT * FROM '+table_schema+'.'+TABLE_name+'" > "C:\Temp\'+Table_Name+'.csv"'
FROM [INFORMATION_SCHEMA].[TABLES]
I saved the results as a batch file and ran the batch file as Administrator.
That runs without an error, but I get no data exported. All it does is create blank CSV files.
I ran this as well: 'EXEC sp_configure 'remote access',1 reconfigure'.
Still, nothing is exported. CSVs are created, but no data is exported...
Any thoughts?
I ended up using R to do the task...
library("RODBC")
conn <- odbcDriverConnect('driver={SQL Server};server=Server_Name;DB_Name;trusted_connection=true')
data <- sqlQuery(conn, "SELECT * FROM DB.dbo.TBL#1")
write.csv(data,file=paste("C:/Users/TBL#1.csv",sep=""),row.names=FALSE)
data <- sqlQuery(conn, "SELECT * FROM DB.dbo.TBL#2")
write.csv(data,file=paste("C:/Users/TBL#2.csv",sep=""),row.names=FALSE)
Gotta love the IT teams in corporate America...especially when they lock down your system so tight, you need to come up with all kinds of weird hacks just so you can do the job that you were hired to do...
Is there a word for negative synergy?

How to create SQL Server table from dplyr pipeline

Due to a bug in dbplyr, copy_to and compute are currently not working for SQL Server connections.
connStr <- "driver=ODBC Driver 13 for SQL Server;server=localhost;..."
db <- DBI::dbConnect(odbc::odbc(), .connection_string=connStr)
copy_to(db, mtcars)
#Error: <SQL> 'CREATE TEMPORARY TABLE "mtcars" (
# "row_names" varchar(255),
# "mpg" FLOAT,
# ...
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
# use raw DBI functionality to create table
DBI::dbWriteTable(db, "mtcars", mtcars)
qry <- tbl(db, "mtcars") %>% group_by(am) %>% summarise(m=mean(mpg))
compute(qry)
#Error: <SQL> 'CREATE TEMPORARY TABLE "isrxofsskr" AS SELECT "am" AS "am", "m" #AS "m"
#FROM (SELECT "am", AVG("mpg") AS "m"
#FROM "mtcars"
#GROUP BY "am") "htrkkxabrn"'
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
There is an active PR on the dbplyr repo that solves this problem, but no indication of when this will be merged (or when it will reach CRAN). In the meantime, how would I create a table from the query, without reading the data into R?
It turns out that the PR on the dbplyr repo is glitched anyway, and will pull the entire table into memory before writing it back.
Fixing the problem requires creating a couple of MSSQL-specific methods for dbplyr generics. These are listed below. I've also posted them to the dbplyr repo so (assuming they work) they should hopefully be merged before too long.
#' #export
`db_compute.Microsoft SQL Server` <- function(con, table, sql, temporary=TRUE,
unique_indexes=list(), indexes=list(), ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(table, 1, 1) != "#")
table <- paste0("##", table)
if(!is.list(indexes))
indexes <- as.list(indexes)
if(!is.list(unique_indexes))
unique_indexes <- as.list(unique_indexes)
db_save_query(con, sql, table, temporary=temporary)
db_create_indexes(con, table, unique_indexes, unique=TRUE)
db_create_indexes(con, table, indexes, unique=FALSE)
table
}
#' #export
`db_save_query.Microsoft SQL Server` <- function(con, sql, name, temporary=TRUE, ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(name, 1, 1) != "#")
name <- paste0("##", name)
tt_sql <- build_sql("SELECT * INTO ", ident_q(name),
" FROM (", sql, ") ", ident_q(name), con=con)
dbExecute(con, tt_sql)
name
}
Note: may not be Bobby Tables-resistant. Testing is advised.

Issues using "-f" flag in CQLSH to run a query.cql file

I'm using cqlsh to add data to Cassandra with the BATCH query and I can load the data with a query using the "-e" flag but not from a file using the "-f" flag. I think that's because the file is local and Cassandra is remote. Details below:
This is a sample of my query (there are more rows to insert, obviously):
BEGIN BATCH;
INSERT INTO keyspace.table (id, field1) VALUES ('1','value1');
INSERT INTO keyspace.table (id, field1) VALUES ('2','value2');
APPLY BATCH;
If I enter the query via the "-e" flag then it works no problem:
>cqlsh -e "BEGIN BATCH; INSERT INTO keyspace.table (id, field1) VALUES ('1','value1'); INSERT INTO keyspace.table (id, field1) VALUES ('2','value2'); APPLY BATCH;" -u username -p password -k keyspace 99.99.99.99
But if I save the query to a text file (query.cql) and call as below, I get the following output:
>cqlsh -f query.cql -u username -p password -k keyspace 99.99.99.99
Using 3 child processes
Starting copy of keyspace.table with columns ['id', 'field1'].
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 0.076 seconds (0 skipped).
Cassandra obviously accepts the command but doesn't read the file, I'm guessing that's because the Cassandra is located on a remote server and the file is located locally. The Cassandra instance I'm using is a managed service with other users, so I don't have access to it to copy files into folders.
How do I run this query on a remote instance of Cassandra where I only have CLI access?
I want to be able to use another tool to build the query.cql file and have a batch job run the command with the "-f" flag but I can't work out how I'm going wrong.
You're executing a local cqlsh client so it should be able to access your local query.cql file.
Try to remove the BEGIN BATCH and APPLY BATCH and just let the 2 INSERT statements in the query.cql and retry again.
One other solution to insert data quickly is to provide a csv file and use the COPY command inside cqlsh. Read this blog post: http://www.datastax.com/dev/blog/new-features-in-cqlsh-copy
Scripting insert by generating one cqlsh -e '...' per line is feasible but it will be horribly slow

Percona's pt-table-sync: how to run on more than one table?

In the command line, this will successfully update table1:
pt-table-sync --execute h=host1,D=db1,t=table1 h=host2,D=db2
However if I want to update more than one table, I'm not sure how to write it. This only updates table1 as well and ignores the other tables:
pt-table-sync --execute h=host1,D=db1,t=table1,table2,table3 h=host2,D=db2
And this gives me an error:
pt-table-sync --execute h=host1,D=db1 --tables table1,table2,table3 h=host2,D=db2
Anyone have an example of how to list the '-tables'... so that it successfully update all the tables in the list?
The --tables option seems to be incompatible with the DSN notation, you get this error:
You specified a database but not a table in h=localhost,D=test.
Are you trying to sync only tables in the 'test' database?
If so, use '--databases test' instead.
As suggested in that error message, you can use --databases and then you can use --tables successfully.
For example, I created tables test.foo and test.bar, filled each with three rows, then deleted the rows from test.bar on the second server dewey.
I ran this:
$ pt-table-sync h=huey h=dewey --databases test --tables foo,bar --execute --verbose
# Syncing h=dewey
# DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE
# 0 0 3 0 Chunk 15:26:15 15:26:15 2 test.bar
# 0 0 0 0 Chunk 15:26:15 15:26:15 0 test.foo
It successfully re-inserted the 3 missing rows in test.bar.
Other tables in my test database were ignored.
This is an old question, but I searched everywhere for an answer. pt-table-sync only does one table. There is no tool that does the same thing to a list of tables or a full database schema. Specifically I want to run a Live server and be able to sync back to a Staging server, then edit code and files in the Staging server without fear of messing up Live or being overwritten by Live... and I want it to be free :)
I ended up writing a shell script called mysql_sync_live_to_stage.sh as follows:
#!/bin/bash
# sync db live to staging
error_log_file='./mysql_sync_errors.log'
echo $(date +"%Y %m %d %H:%M") > $error_log_file
function sync_table()
{
pt-table-sync --no-foreign-key-checks --execute
h=DB_1_HOST,u=DB_1_USER,p=DB_1_PASSWORD,D=$1,t=$3
h=DB_2_HOST,u=DB_2_USER,p=DB_2_PASSWORD,D=$2,t=$3 >> $error_log_file
}
# SYNC ALL TABLES IN name_of_live_database
mysql -h "DB_1_HOST" -u "DB_1_USER" -pDB_1_PASSWORD -D "DB_1_DBNAME" -e "SHOW TABLES" |
egrep -i '[0-9a-z\-\_]+' | egrep -i -v 'Tables_in' | while read -r table ; do
echo "Processing $table"
sync_table "name_of_live_database" "name_of_staging_database" $table
done
# FIX Config Settings For Staging
echo "Cleanup Queries..."
mysql -h "DB_2_HOST" -u "DB_2_USER" -pDB_2_PASSWORD -D "DB_2_DBNAME"
-e "UPDATE name_of_staging_database.nameofmyconfigtable SET value='bar'
WHERE config_id='foo'"
mysql -h "DB_2_HOST" -u "DB_2_USER" -pDB_2_PASSWORD -D "DB_2_DBNAME"
-e "UPDATE name_of_staging_database.nameofmyconfigtable SET value='bar2'
WHERE config_id='foo2'"
echo "Done"
This reads a list of table names from the live site then executes a sync on each one via the do loop. It goes through the list alphabetically, so I recommend keeping the --no-foreign-key-checks flag.
Its not perfect... It won't sync tables that don't exist in both databases, but when combined with a "git pull -f origin master" I get a complete sync in a couple minutes.

Using COPY FROM stdin to load tables, reading input file only once

I've got a large (~60 million row) fixed width source file with ~1800 records per row.
I need to load this file into 5 different tables on an instance of Postgres 8.3.9.
My dilemma is that, because the file is so large, I'd like to have to read it only once.
This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.
The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.
#!/bin/bash
table1="copytest1"
table2="copytest2"
#note: $1 refers to the first argument used when invoking this script
#which should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables
( echo 'BEGIN;'
echo 'TRUNCATE TABLE ' ${table1} ';'
echo 'COPY ' ${table1} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table1}.py
echo '\.'
echo 'TRUNCATE TABLE ' ${table2} ';'
echo 'COPY ' ${table2} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table2}.py
echo '\.'
echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name
exit 0
Thanks!
You could use named pipes instead your anonymous pipe.
With this concept your python script could fill the tables through different psql processes with the corresponding data.
Create pipes:
mkfifo fifo_table1
mkfifo fifo_table2
Run psql instances:
psql db_name < fifo_table1 &
psql db_name < fifo_table2 &
Your python script would look about so (Pseudocode):
SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')
bigfile = open('mybigfile', 'r')
print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax
print >> fifo2, SQL_BEGIN % ('table2', 'table2')
for line in bigfile:
# your code, which decides where the data belongs to
# if data belongs to table1
print >> fifo1, data
# else
print >> fifo2, data
print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'
fifo1.close()
fifo2.close()
Maybe this is not the most elegant solution, but it should work.
Why use COPY for the second table? I would assume that doing a:
INSERT INTO table2 (...)
SELECT ...
FROM table1;
would be faster than using COPY.
Edit
If you need to import different rows into different tables but from the same source file, maybe inserting everything into a staging table and then inserting the rows from there into the target tables is faster:
Import the .whole* text file into one staging table:
COPY staging_table FROM STDIN ...;
After that step, the whole input file is in staging_table
Then copy the rows from the staging table to the individual target tables by selecting only those that qualify for the corresponding table:
INSERT INTO table_1 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_1);
INSERT INTO table_2 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_2);
This is of course only feasible if you have enough space in your database to keep the staging table around.

Resources