How to create SQL Server table from dplyr pipeline - sql-server

Due to a bug in dbplyr, copy_to and compute are currently not working for SQL Server connections.
connStr <- "driver=ODBC Driver 13 for SQL Server;server=localhost;..."
db <- DBI::dbConnect(odbc::odbc(), .connection_string=connStr)
copy_to(db, mtcars)
#Error: <SQL> 'CREATE TEMPORARY TABLE "mtcars" (
# "row_names" varchar(255),
# "mpg" FLOAT,
# ...
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
# use raw DBI functionality to create table
DBI::dbWriteTable(db, "mtcars", mtcars)
qry <- tbl(db, "mtcars") %>% group_by(am) %>% summarise(m=mean(mpg))
compute(qry)
#Error: <SQL> 'CREATE TEMPORARY TABLE "isrxofsskr" AS SELECT "am" AS "am", "m" #AS "m"
#FROM (SELECT "am", AVG("mpg") AS "m"
#FROM "mtcars"
#GROUP BY "am") "htrkkxabrn"'
# nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC Driver 13 for SQL Server][SQL Server]Unknown object type 'TEMPORARY' used in a CREATE, DROP, or ALTER statement.
There is an active PR on the dbplyr repo that solves this problem, but no indication of when this will be merged (or when it will reach CRAN). In the meantime, how would I create a table from the query, without reading the data into R?

It turns out that the PR on the dbplyr repo is glitched anyway, and will pull the entire table into memory before writing it back.
Fixing the problem requires creating a couple of MSSQL-specific methods for dbplyr generics. These are listed below. I've also posted them to the dbplyr repo so (assuming they work) they should hopefully be merged before too long.
#' #export
`db_compute.Microsoft SQL Server` <- function(con, table, sql, temporary=TRUE,
unique_indexes=list(), indexes=list(), ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(table, 1, 1) != "#")
table <- paste0("##", table)
if(!is.list(indexes))
indexes <- as.list(indexes)
if(!is.list(unique_indexes))
unique_indexes <- as.list(unique_indexes)
db_save_query(con, sql, table, temporary=temporary)
db_create_indexes(con, table, unique_indexes, unique=TRUE)
db_create_indexes(con, table, indexes, unique=FALSE)
table
}
#' #export
`db_save_query.Microsoft SQL Server` <- function(con, sql, name, temporary=TRUE, ...)
{
# check that name has prefixed '##' if temporary
if(temporary && substr(name, 1, 1) != "#")
name <- paste0("##", name)
tt_sql <- build_sql("SELECT * INTO ", ident_q(name),
" FROM (", sql, ") ", ident_q(name), con=con)
dbExecute(con, tt_sql)
name
}
Note: may not be Bobby Tables-resistant. Testing is advised.

Related

Pandas dataframe insert into SQL Server taking too long with execute and executemany

I have a pandas dataframe with 27 columns and ~45k rows that I need to insert into a SQL Server table.
I am currently using with the below code and it takes 90 mins to insert:
conn = pyodbc.connect('Driver={ODBC Driver 17 for SQL Server};\
Server=#servername;\
Database=dbtest;\
Trusted_Connection=yes;')
cursor = conn.cursor() #Create cursor
for index, row in t6.iterrows():
cursor.execute("insert into dbtest.dbo.test( col1, col2, col3, col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,,col27)\
values (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)",
row['col1'],row['col2'], row['col3'],,row['col27'])
I have also tried to load using executemany and that takes even longer to complete, at nearly 120mins.
I am really looking for a faster load time since I need to run this daily.
You can set fast_executemany in pyodbc itself for versions>=4.0.19. It is off by default.
import pyodbc
server_name = 'localhost'
database_name = 'AdventureWorks2019'
table_name = 'MyTable'
driver = 'ODBC Driver 17 for SQL Server'
connection = pyodbc.connect(driver='{'+driver+'}', server=server_name, database=database_name, trusted_connection='yes')
cursor = connection.cursor()
cursor.fast_executemany = True # reduce number of calls to server on inserts
# form SQL statement
columns = ", ".join(df.columns)
values = '('+', '.join(['?']*len(df.columns))+')'
statement = "INSERT INTO "+table_name+" ("+columns+") VALUES "+values
# extract values from DataFrame into list of tuples
insert = [tuple(x) for x in df.values]
cursor.executemany(statement, insert)
Or if you prefer sqlalchemy and dataframes directly.
import sqlalchemy as db
engine = db.create_engine('mssql+pyodbc://#'+server_name+'/'+database_name+'?trusted_connection=yes&driver='+driver, fast_executemany=True)
df.to_sql(table_name, engine, if_exists='append', index=False)
See fast_executemany in this link.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
I have worked through this in the past, and this was the fastest that I could get it to work using sqlalchemy.
import sqlalchemy as sa
engine = (sa.create_engine(f'mssql://#{server}/{database}
?trusted_connection=yes&driver={driver_name}', fast_executemany=True)) #windows authentication
df.to_sql('Daily_Report', con=engine, if_exists='append', index=False)
If the engine is not working for you, then you may have a different setup so please see: https://docs.sqlalchemy.org/en/13/core/engines.html
You should be able to create the variables needed above, but here is how I get the driver:
driver_name = ''
driver_names = [x for x in pyodbc.drivers() if x.endswith(' for SQL Server')]
if driver_names:
driver_name = driver_names[-1] #You may need to change the [-1] if wrong driver to [-2] or a different option in the driver_names list.
if driver_name:
conn_str = f'''DRIVER={driver_name};SERVER='''
else:
print('(No suitable driver found. Cannot connect.)')
You can try to use the method 'multi' built in pandas to_sql.
df.to_sql('table_name', con=engine, if_exists='replace', index=False, method='multi')
The multi method allows you to 'Pass multiple values in a single INSERT clause.' per documentation.
I found it to be pretty efficient.

SQL Server R Services - outputting data to database table, performance

I noticed that rx* functions (eg. rxKmeans, rxDataStep) insert data to SQL Server table in a row-by-row fashion when outFile parameter is set to a table. This is obviously very slow and something like bulk-insert would be desirable instead. Can this be obtained and how to do it?
Currently I am trying to insert about 14 mln rows to a table by invoking rxKmeans function with outFile parameter specified and it takes about 20 minutes.
Example of my code:
clustersLogInitialPD <- rxKmeans(formula = ~LogInitialPD
,data = inDataSource
,algorithm = "Lloyd"
,centers = start_c
,maxIterations = 1
,outFile = sqlLogPDClustersDS
,outColName = "ClusterNo"
,overwrite = TRUE
,writeModelVars = TRUE
,extraVarsToWrite = c("LoadsetId", "ExposureId")
,reportProgress = 0
)
sqlLogPDClustersDS points to a table in my database.
I am working on SQL Server 2016 SP1 with R Services installed and configured (both in-database and standalone). Generally everything works fine except this terrible performance of writing rows to database tables from R scrip.
Any comments will be greatly appreciated.
I brought this up on this Microsoft R MSDN forum thread recently as well.
I ran into this problem and I'm aware of 2 reasonable solutions.
Use sp_execute_external_script output data frame option
/* Time writing data back to SQL from R */
SET STATISTICS TIME ON
IF object_id('tempdb..#tmp') IS NOT NULL
DROP TABLE #tmp
CREATE TABLE #tmp (a FLOAT NOT NULL, b INT NOT NULL );
DECLARE #numRows INT = 1000000
INSERT INTO #tmp (a, b)
EXECUTE sys.sp_execute_external_script
#language = N'R'
,#script = N'OutputDataSet <- data.frame(a=rnorm(numRows), b=1)'
,#input_data_1 = N''
, #output_data_1_name = N'OutputDataSet'
,#params = N' #numRows INT'
,#numRows = #numRows
GO
-- ~7-8 seconds for 1 million row insert (2 columns) on my server
-- rxDataStep for 100K rows takes ~45 seconds on my server
Use SQL Server bcp.exe or BULK INSERT (only if running on the SQL box itself) after first writing a data frame to a flat file
I've written some code that does this but it's not very polished and I've had to leave sections with <<<VARIABLE>>> that assume connection string information (server, database, schema, login, password). If you find this useful or any bugs please let me know. I'd also love to see Microsoft incorporate the ability to save data from R back to SQL Server using BCP APIs. Solution (1) above only works via sp_execute_external_script. Basic testing also leads me to believe that bcp.exe can be roughly twice as fast as option (1) for a million rows. BCP will result in a minimally-logged SQL operation so I'd expect it to be faster.
# Creates a bcp file format function needed to insert data into a table.
# This should be run one-off during code development to generate the format needed for a given task and saved in a the .R file that uses it
createBcpFormatFile <- function(formatFileName, tableName) {
# Command to generate BCP file format for importing data into SQL Server
# https://msdn.microsoft.com/en-us/library/ms162802.aspx
# format creates a format file based on the option specified (-n, -c, -w, or -N) and the table or view delimiters. When bulk copying data, the bcp command can refer to a format file, which saves you from re-entering format information interactively. The format option requires the -f option; creating an XML format file, also requires the -x option. For more information, see Create a Format File (SQL Server). You must specify nul as the value (format nul).
# -c Performs the operation using a character data type. This option does not prompt for each field; it uses char as the storage type, without prefixes and with \t (tab character) as the field separator and \r\n (newline character) as the row terminator. -c is not compatible with -w.
# -x Used with the format and -f format_file options, generates an XML-based format file instead of the default non-XML format file. The -x does not work when importing or exporting data. It generates an error if used without both format and -f format_file.
## Bob: -x not used because we currently target bcp version 8 (default odbc driver compatibility that is installed everywhere)
# -f If -f is used with the format option, the specified format_file is created for the specified table or view. To create an XML format file, also specify the -x option. For more information, see Create a Format File (SQL Server).
# -t field_term Specifies the field terminator. The default is \t (tab character). Use this parameter to override the default field terminator. For more information, see Specify Field and Row Terminators (SQL Server).
# -S server_name [\instance_name] Specifies the instance of SQL Server to which to connect. If no server is specified, the bcp utility connects to the default instance of SQL Server on the local computer. This option is required when a bcp command is run from a remote computer on the network or a local named instance. To connect to the default instance of SQL Server on a server, specify only server_name. To connect to a named instance of SQL Server, specify server_name\instance_name.
# -U login_id Specifies the login ID used to connect to SQL Server.
# -P -P password Specifies the password for the login ID. If this option is not used, the bcp command prompts for a password. If this option is used at the end of the command prompt without a password, bcp uses the default password (NULL).
bcpPath <- .pathToBcpExe()
parsedTableName <- parseName(tableName)
# We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
# -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])
bcpOptions <- paste0("format nul -c -f ", formatFileName, " -t, ", .bcpConnectionOptions())
commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)
result <- .bcpRunShellThrowErrors(commandToRun)
}
# Save a data frame (data) using file format (formatFilePath) to a table on the database (tableName)
bcpDataToTable <- function(data, formatFilePath, tableName) {
numRows <- nrow(data)
# write file to disk
ptm <- proc.time()
tmpFileName <- tempfile("bcp", tmpdir=getwd(), fileext=".csv")
write.table(data, file=tmpFileName, quote=FALSE, row.names=FALSE, col.names=FALSE, sep=",")
# Bob: note that one can make this significantly faster by switching over to use the readr package (readr::write_csv)
#readr::write_csv(data, tmpFileName, col_names=FALSE)
# bcp file to server time start
mid <- proc.time()
bcpPath <- .pathToBcpExe()
parsedTableName <- parseName(tableName)
# We can't use the -d option for BCP and instead need to fully qualify a table (database.schema.table)
# -d database_name Specifies the database to connect to. By default, bcp.exe connects to the user’s default database. If -d database_name and a three part name (database_name.schema.table, passed as the first parameter to bcp.exe) is specified, an error will occur because you cannot specify the database name twice.If database_name begins with a hyphen (-) or a forward slash (/), do not add a space between -d and the database name.
fullyQualifiedTableName <- paste0(parsedTableName["dbName"], ".", parsedTableName["schemaName"], ".", parsedTableName["tableName"])
bcpOptions <- paste0(" in ", tmpFileName, " ", .bcpConnectionOptions(), " -f ", formatFilePath, " -h TABLOCK")
commandToRun <- paste0(bcpPath, " ", fullyQualifiedTableName, " ", bcpOptions)
result <- .bcpRunShellThrowErrors(commandToRun)
cat(paste0("time to save dataset to disk (", numRows, " rows):\n"))
print(mid - ptm)
cat(paste0("overall time (", numRows, " rows):\n"))
proc.time() - ptm
unlink(tmpFileName)
}
# Examples:
# createBcpFormatFile("test2.fmt", "temp_bob")
# data <- data.frame(x=sample(1:40, 1000, replace=TRUE))
# bcpDataToTable(data, "test2.fmt", "test_bcp_1")
#####################
# #
# Private functions #
# #
#####################
# Path to bcp.exe. bcp.exe is currently from version 8 (SQL 2000); newer versions depend on newer SQL Server ODBC drivers and are harder to copy/paste distribute
.pathToBcpExe <- function() {
paste0(<<<bcpFolder>>>, "/bcp.exe")
}
# Function to convert warnings from shell into errors always
.bcpRunShellThrowErrors <- function(commandToRun) {
tryCatch({
shell(commandToRun)
}, warning=function(w) {
conditionMessageWithoutPassword <- gsub(<<<connectionStringSqlPassword>>>, "*****", conditionMessage(w), fixed=TRUE) # Do not print SQL passwords in errors
stop("Converted from warning: ", conditionMessageWithoutPassword)
})
}
# The connection options needed to establish a connection to the client database
.bcpConnectionOptions <- function() {
if (<<<useTrustedConnection>>>) {
return(paste0(" -S ", <<<databaseServer>>>, " -T"))
} else {
return(paste0(" -S ", <<<databaseServer>>>, " -U ", <<<connectionStringLogin>>>," -P ", <<<connectionStringSqlPassword>>>))
}
}
###################
# Other functions #
###################
# Mirrors SQL Server parseName function
parseName <- function(databaseObject) {
splitName <- strsplit(databaseObject, '.', fixed=TRUE)[[1]]
if (length(splitName)==3){
dbName <- splitName[1]
schemaName <- splitName[2]
tableName <- splitName[3]
} else if (length(splitName)==2){
dbName <- <<<databaseServer>>>
schemaName <- splitName[1]
tableName <- splitName[2]
} else if (length(splitName)==1){
dbName <- <<<databaseName>>>
schemaName <- ""
tableName <- splitName[1]
}
return(c(tableName=tableName, schemaName=schemaName, dbName=dbName))
}

RODBC::sqlSave - problems creating/appending to a table

Related to several other questions on the RODBC package, I'm having problems using RODBC::sqlSave to write to a table on a SQL Server database. I'm using MS SQL Server 2008 and 64-bit R on a Windows RDP.
The solution in the 3rd link (questions) does work [sqlSave(ch, df)]. But in this case, it writes to the wrong data base. That is, my default DB is "C2G" but I want to write to "BI_Sandbox". And it doesn't allow for options such as rownames, etc. So there still seems to be a problem in the package.
Obviously, a possible solution would be to change my ODBC solution to the specified database, but it seems there should be a better method. And this wouldn't solve the problem of unusable parameters in the sqlSave command--such as rownames, varTypes, etc.
I have the following ODBC- System DSN connnection:
Microsoft SQL Server Native Client Version 11.00.3000
Data Source Name: c2g
Data Source Description: c2g
Server: DC01-WIN-SQLEDW\BISQL01,29537
Use Integrated Security: Yes
Database: C2G
Language: (Default)
Data Encryption: No
Trust Server Certificate: No
Multiple Active Result Sets(MARS): No
Mirror Server:
Translate Character Data: Yes
Log Long Running Queries: No
Log Driver Statistics: No
Use Regional Settings: No
Use ANSI Quoted Identifiers: Yes
Use ANSI Null, Paddings and Warnings: Yes
R code:
R> ch <- odbcConnect("c2g")
R> sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]",
append= FALSE, rownames= FALSE, colnames= FALSE)
Error in sqlColumns(channel, tablename) :
‘[bi_sandbox].[dbo].[table1]’: table not found on channel
# after error, try again:
R> sqlDrop(ch, "[bi_sandbox].[dbo].[table1]", errors = FALSE)
R> sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]",
append= FALSE, rownames= FALSE, colnames= FALSE)
Error in sqlSave(ch, zinq_scores, tablename = "[bi_sandbox].[dbo].[table1]", :
42S01 2714 [Microsoft][SQL Server Native Client 11.0][SQL Server]There is already an object named 'table1' in the database.
[RODBC] ERROR: Could not SQLExecDirect 'CREATE TABLE [bi_sandbox].[dbo].[table1] ("credibility_review" float, "creditbuilder" float, "no_product" float, "duns" varchar(255), "pos_credrev" varchar(5), "pos_credbuild" varchar(5))'
In the past, I've gotten around this by running the supremely inefficient sqlQuery with insert into row-by-row to get around this. But I tried this time and no data was written. Although the sqlQuery statement did not have an error or warning message.
temp <-"INSERT INTO [bi_sandbox].[dbo].[table1]
+ (credibility_review, creditbuilder, no_product, duns, pos_credrev, pos_credbuild) VALUES ("
>
> for(i in 1:nrow(zinq_scores)) {
+ sqlQuery(ch, paste(temp, "'", zinq_scores[i, 1], "'",",", " ",
+ "'", zinq_scores[i, 2], "'", ",",
+ "'", zinq_scores[i, 3], "'", ",",
+ "'", zinq_scores[i, 4], "'", ",",
+ "'", zinq_scores[i, 5], "'", ",",
+ "'", zinq_scores[i, 6], "'", ")"))
+ }
> str(sqlQuery(ch, "select * from [bi_sandbox].[dbo].[table1]"))
'data.frame': 0 obs. of 6 variables:
$ credibility_review: chr
$ creditbuilder : chr
$ no_product : chr
$ duns : chr
$ pos_credrev : chr
$ pos_credbuild : chr
Any help would be greatly appreciated.
Also, if there is any missing detail, please let me know and I'll edit the question.
My apologies up front. This is not exactly a "simple example." It's pretty trivial, but there are a lot of parts. And by the end, you'll probably think I'm crazy for doing it this way.
Starting in SQL Server Management Studio
First, I've created a database on SQL Server called mtcars with default schema dbo. I've also added myself as a user. Under my own user name, I am the database owner, so I can do anything I want to the database, but from R, I will connect using a generic account that only has EXECUTE privileges.
The predefined table in the database that we are going to write to is called mtcars. (So the full path to the table is mtcars.dbo.mtcars; it's lazy, I know). The code to define the table is
USE [mtcars]
GO
/****** Object: Table [dbo].[mtcars] Script Date: 2/22/2016 11:56:53 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[mtcars](
[OID] [int] IDENTITY(1,1) NOT NULL,
[mpg] [numeric](18, 0) NULL,
[cyl] [numeric](18, 0) NULL,
[disp] [numeric](18, 0) NULL,
[hp] [numeric](18, 0) NULL
) ON [PRIMARY]
GO
Stored Procedures
I'm going to use two stored procedures. The first is an "UPSERT" procedure, that will first try to update a row in a table. If that fails, it will insert the row into the table.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE dbo.sample_procedure
#OID int = 0,
#mpg numeric(18,0) = 0,
#cyl numeric(18,0) = 0,
#disp numeric(18,0) = 0,
#hp numeric(18,0) = 0
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
-- TRANSACTION code borrowed from
-- http://stackoverflow.com/a/21209131/1017276
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
UPDATE dbo.mtcars
SET mpg = #mpg,
cyl = #cyl,
disp = #disp,
hp = #hp
WHERE OID = #OID;
IF ##ROWCOUNT = 0
BEGIN
INSERT dbo.mtcars (mpg, cyl, disp, hp)
VALUES (#mpg, #cyl, #disp, #hp)
END
COMMIT TRANSACTION;
END
GO
Another stored procedure I will use is just the equivalent of RODBC::sqlFetch. As far as I can tell, sqlFetch depends on SQL injection, and I'm not allowed to use it. Just to be on the safe side of our data security policies, I write little procedures like this (Data security is pretty tight here, you may or may not need this)
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE dbo.get_mtcars
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
SELECT * FROM dbo.mtcars
END
GO
Now, from R
I have a utility function I use to help me manage inputting data into the stored procedures. sqlSave would do a lot of this automatically, so I'm kind of reinventing the wheel. The gist of the utility function is to determine if the value I'm pushing to the database needs to be nested in quotes or not.
#* Utility function. This does a couple helpful things like
#* Convert NA and NULL into a SQL NULL
#* wrap character strings and dates in single quotes
sqlNullString <- function(value, numeric=FALSE)
{
if (is.null(value)) value <- "NULL"
if (is.na(value)) value <- "NULL"
if (inherits(value, "Date")) value <- format(x = value, format = "%Y-%m-%d")
if (value == "NULL") return(value)
else if (numeric) return(value)
else return(paste0("'", value, "'"))
}
This next step isn't strictly necessary, but I'm going to do it just so that my R table is similar to my SQL table. This is organizational strategy on my part.
mtcars$OID <- NA
Now let's establish our connection:
server <- "[server_name]"
uid <- "[generic_user_name]"
pwd <- "[password]"
library(RODBC)
channel <- odbcDriverConnect(paste0("driver=SQL Server;",
"server=", server, ";",
"database=mtcars;",
"uid=", uid, ";",
"pwd=", pwd))
Now this next part is pure laziness. I'm going to use a for loop to push each row of the data frame the to SQL table one at a time. As noted in the original question, this is kind of inefficient. I'm sure I could write a stored procedure to accept several vectors of data, compile them into a temporary table, and do the UPSERT in SQL, but I don't work with large data sets when I'm doing this, and so it hasn't yet been worth it to me to write such a procedure. Instead, I prefer to stick with the code that is a little easier for me to reason with on my limited SQL skills.
Here, we're just going to push the first 5 rows of mtcars
#* Insert the first 5 rows into the SQL Table
for (i in 1:5)
{
sqlQuery(channel = channel,
query = paste0("EXECUTE dbo.sample_procedure ",
"#OID = ", sqlNullString(mtcars$OID[i]), ", ",
"#mpg = ", mtcars$mpg[i], ", ",
"#cyl = ", mtcars$cyl[i], ", ",
"#disp = ", mtcars$disp[i], ", ",
"#hp = ", mtcars$hp[i]))
}
And now we'll take a look at the table from SQL
sqlQuery(channel = channel,
query = "EXECUTE dbo.get_mtcars")
This next line is just to match up the OIDs in R and SQL for illustration purposes. Normally, I would do this manually.
mtcars$OID[1:5] <- 1:5
This next for loop will UPSERT all 32 rows. We already have 5, we're UPSERTing 32, and the SQL table at the end should have 32 if we've done it correctly. (That is, SQL will recognize the 5 rows that already exist)
#* Update/Insert (UPSERT) the entire table
for (i in 1:nrow(mtcars))
{
sqlQuery(channel = channel,
query = paste0("EXECUTE dbo.sample_procedure ",
"#OID = ", sqlNullString(mtcars$OID[i]), ", ",
"#mpg = ", mtcars$mpg[i], ", ",
"#cyl = ", mtcars$cyl[i], ", ",
"#disp = ", mtcars$disp[i], ", ",
"#hp = ", mtcars$hp[i]))
}
#* Notice that the first 5 rows were unchanged (though they would have changed
#* if we had changed the data...the point being that the stored procedure
#* correctly identified that these records already existed)
sqlQuery(channel = channel,
query = "EXECUTE dbo.get_mtcars")
Recap
The stored procedure approach has a major disadvantage in that it is blatantly reinventing the wheel. It also requires that you learn SQL. SQL is pretty easy to learn for simple tasks, but some of the code I've written for more complex tasks is pretty difficult to interpret. Some of my procedures have taken me the better part of a day to get right. (once they are done, however, they work incredibly well)
The other big disadvantage to the stored procedure is, I've noticed, it does require a little bit more code work and organization. I'd say it's probably been about 10% more code work and documentation than if I were just using SQL Injection.
The chief advantages of the stored procedures approach are
you have massive flexibility for what you want to do
You can store your SQL code into the database and not pollute your R code with potentially huge strings of SQL code
Avoiding SQL injection (again, this is a data security thing, and may not be an issue depending on your employer's policies. I'm strictly forbidden from using SQL injection, so stored procedures are my only option)
It should also be noted that I've not yet explored using Table-Valued parameters in my stored procedures, which might simplify things for me a bit.
In the past, I've gotten around this by running the supremely inefficient sqlQuery with insert into row-by-row to get around this. But I tried this time and no data was written. Although the sqlQuery statement did not have an error or warning message.
Faced it yesterday: in my case the issue was in scheme. The table was actually created but in my user own scheme.
First time you can create it and than you have this error (that object already exists)
After the investigation I found that some packages does not work correctly with schemes.
In the end I used "insert by line" solution. The solution is available here and here

Multipart queries in SQL Server with RODBC

I am trying to use GO to get R to pull a multipart query from a SQL Server database but R keeps erroring out on me when I try this. Does anyone know a workaround to get RODBC to run multipart queries?
Example query:
query2 = "IF OBJECT_ID('tempdb..#ATTTempTable') IS NOT NULL
DROP TABLE #ATTTempTable
GO
SELECT
* INTO #ATTTempTable
FROM ETL.ATT.fact_responses fr
WHERE fr.ResponseDateTime > '2015-07-06'
"
channel <- odbcConnect("<host name>", uid="<uid>", pwd="<pwd>")
raw = sqlQuery(channel, query2)
close(channel)
and result
> raw
[1] "42000 102 [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Incorrect syntax near 'GO'."
[2] "[RODBC] ERROR: Could not SQLExecDirect 'IF OBJECT_ID('tempdb..#ATTTempTable') IS NOT NULL\n DROP TABLE #ATTTempTable\n\nGO\n\nSELECT\n\t* INTO #ATTTempTable\nFROM ETL.ATT.fact_responses fr\nWHERE fr.ResponseDateTime > '2015-07-06'\n'"
>
Because your query contains multiple line with conditional logic it resembles a stored procedure.
Simply save that stored procedure in SQL Server:
CREATE PROCEDURE sqlServerSp #ResponseDateTime nvarchar(10)
AS
IF OBJECT_ID('tempdb..#ATTTempTable') IS NOT NULL
DROP TABLE #ATTTempTable;
GO
-- suppresses affected rows message so RODBC returns a dataset
SET NO COUNT ON;
GO
-- runs make-table action query
SELECT * INTO #ATTTempTable
FROM ETL.ATT.fact_responses fr
WHERE fr.ResponseDateTime > #ResponseDateTime;
GO
And then run the stored procedure in R. You can even pass parameters like the date:
channel <- odbcConnect("<host name>", uid="<uid>", pwd="<pwd>")
raw = sqlQuery(channel, "EXEC sqlServerSp #ResponseDateTime='2015-07-06'")
close(channel)
You can't. See https://msdn.microsoft.com/en-us/library/ms188037.aspx
you have to divide your query into two statements and run them separately.

Bind variables in R DBI

In R's DBI package, I'm not finding a facility for using bound variables. I did find a document (the original vignette from 2002) that says about bound variables, "Perhaps the DBI could at some point in the future implement this feature", but it looks like so far that's left undone.
What do people in R use for a substitute? Just concatenate strings right into the SQL? That's got some obvious problems for safety & performance.
EDIT:
Here's an example of how placeholders could work:
query <- "SELECT numlegs FROM animals WHERE color=?"
result <- dbGetQuery(caseinfo, query, bind="green")
That's not a very well-thought-out interface, but the idea is that you can use a value for bind and the driver handles the details of escaping (if the underlying API doesn't handle bound variables natively) without the caller having to reimplement it [badly].
For anyone coming to this question like I just did after googling for rsqlite and dbgetpreparedquery, it seems that in the latest version of rsqlite you can run a SELECT query with bind variables. I just ran the following:
query <- "SELECT probe_type,next_base,color_channel FROM probes WHERE probeid=?"
probe.types.df <- dbGetPreparedQuery(con,que,bind.data=data.frame(probeids=ids))
This was relatively fast (selecting 2,000 rows out of a 450,000 row table) and is incredibly useful.
FYI.
Below is a summary of what's currently supported in RSQLite for bound
parameters. You are right that there is not currently support for
SELECT, but there is no good reason for this and I would like to add
support for it.
If you feel like hacking, you can get a read-only checkout of all of
the DBI related packages here:
use --user=readonly --password=readonly
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk/DBI
https://hedgehog.fhcrc.org/compbio/r-dbi/trunk/SQLite/RSQLite
I like to receive patches, especially if they include tests and
documentation. Unified diff, please. I actually do all my
development using git and so best case is to create a git clone of say
RSQLite and then send me diffs as git format-patch -n
git-svn..
Anyhow, here are some examples:
library("RSQLite")
make_data <- function(n)
{
alpha <- c(letters, as.character(0:9))
make_key <- function(n)
{
paste(sample(alpha, n, replace = TRUE), collapse = "")
}
keys <- sapply(sample(1:5, replace=TRUE), function(x) make_key(x))
counts <- sample(seq_len(1e4), n, replace = TRUE)
data.frame(key = keys, count = counts, stringsAsFactors = FALSE)
}
key_counts <- make_data(100)
db <- dbConnect(SQLite(), dbname = ":memory:")
sql <- "
create table keys (key text, count integer)
"
dbGetQuery(db, sql)
bulk_insert <- function(sql, key_counts)
{
dbBeginTransaction(db)
dbGetPreparedQuery(db, sql, bind.data = key_counts)
dbCommit(db)
dbGetQuery(db, "select count(*) from keys")[[1]]
}
## for all styles, you can have up to 999 parameters
## anonymous
sql <- "insert into keys values (?, ?)"
bulk_insert(sql, key_counts)
## named w/ :, $, #
## names are matched against column names of bind.data
sql <- "insert into keys values (:key, :count)"
bulk_insert(sql, key_counts[ , 2:1])
sql <- "insert into keys values ($key, $count)"
bulk_insert(sql, key_counts)
sql <- "insert into keys values (#key, #count)"
bulk_insert(sql, key_counts)
## indexed (NOT CURRENTLY SUPPORTED)
## sql <- "insert into keys values (?1, ?2)"
## bulk_insert(sql)
Hey hey - I just discovered that RSQLite, which is what I'm using in this case, does indeed have bound-variable support:
http://cran.r-project.org/web/packages/RSQLite/NEWS
See the entry about dbSendPreparedQuery() and dbGetPreparedQuery().
So in theory, that turns this nastiness:
df <- data.frame()
for (x in data$guid) {
query <- paste("SELECT uuid, cites, score FROM mytab WHERE uuid='",
x, "'", sep="")
df <- rbind(df, dbGetQuery(con, query))
}
into this:
df <- dbGetPreparedQuery(
con, "SELECT uuid, cites, score FROM mytab WHERE uuid=:guid", data)
Unfortunately, when I actually try it, it seems that it's only for INSERT statements and the like, not for SELECT statements, because I get an error: RS-DBI driver: (cannot have bound parameters on a SELECT statement).
Providing that capability would be fantastic.
The next step would be to hoist this up into DBI itself so that all DBs can take advantage of it, and provide a default implementation that just pastes it into the string like we're all doing ourselves now.

Resources