How to pass data.frame for UPDATE with R DBI - database

With RODBC, there were functions like sqlUpdate(channel, dat, ...) that allowed you pass dat = data.frame(...) instead of having to construct your own SQL string.
However, with R's DBI, all I see are functions like dbSendQuery(conn, statement, ...) which only take a string statement and gives no opportunity to specify a data.frame directly.
So how to UPDATE using a data.frame with DBI?

Really late, my answer, but maybe still helpful...
There is no single function (I know) in the DBI/odbc package but you can replicate the update behavior using a prepared update statement (which should work faster than RODBC's sqlUpdate since it sends the parameter values as a batch to the SQL server:
library(DBI)
library(odbc)
con <- dbConnect(odbc::odbc(), driver="{SQL Server Native Client 11.0}", server="dbserver.domain.com\\default,1234", Trusted_Connection = "yes", database = "test") # assumes Microsoft SQL Server
dbWriteTable(con, "iris", iris, row.names = TRUE) # create and populate a table (adding the row names as a separate columns used as row ID)
update <- dbSendQuery(con, 'update iris set "Sepal.Length"=?, "Sepal.Width"=?, "Petal.Length"=?, "Petal.Width"=?, "Species"=? WHERE row_names=?')
# create a modified version of `iris`
iris2 <- iris
iris2$Sepal.Length <- 5
iris2$Petal.Width[2] <- 1
iris2$row_names <- rownames(iris) # use the row names as unique row ID
dbBind(update, iris2) # send the updated data
dbClearResult(update) # release the prepared statement
# now read the modified data - you will see the updates did work
data1 <- dbReadTable(con, "iris")
dbDisconnect(con)
This works only if you have a primary key which I created in the above example by using the row names which are a unique number increased by one for each row...
For more information about the odbc package I have used in the DBI dbConnect statement see: https://github.com/rstats-db/odbc

Building on R Yoda's answer, I made myself the helper function below. This allows using a dataframe to specify update conditions.
While I built this to run transaction updates (i.e. single rows), it can in theory update multiple rows passing a condition. However, that's not the same as updating multiple rows using an input dataframe. Maybe somebody else can build on this...
dbUpdateCustom = function(x, key_cols, con, schema_name, table_name) {
if (nrow(x) != 1) stop("Input dataframe must be exactly 1 row")
if (!all(key_cols %in% colnames(x))) stop("All columns specified in 'key_cols' must be present in 'x'")
# Build the update string --------------------------------------------------
df_key <- dplyr::select(x, one_of(key_cols))
df_upt <- dplyr::select(x, -one_of(key_cols))
set_str <- purrr::map_chr(colnames(df_upt), ~glue::glue_sql('{`.x`} = {x[[.x]]}', .con = con))
set_str <- paste(set_str, collapse = ", ")
where_str <- purrr::map_chr(colnames(df_key), ~glue::glue_sql("{`.x`} = {x[[.x]]}", .con = con))
where_str <- paste(where_str, collapse = " AND ")
update_str <- glue::glue('UPDATE {schema_name}.{table_name} SET {set_str} WHERE {where_str}')
# Execute ------------------------------------------------------------------
query_res <- DBI::dbSendQuery(con, update_str)
DBI::dbClearResult(query_res)
return (invisible(TRUE))
}
Where
x: 1-row dataframe that contains 1+ key columns, and 1+ update columns.
key_cols: character vector, of 1 or more column names that are the keys (i.e. used in the WHERE clause)

Here is a little helper function I put together using REPLACE INTO to update a table using DBI, replacing old duplicate entries with the new values. It's basic and for my own needs, but should be easy to modify. All you need to pass to the function is the connection, table name, and dataframe. Note that the table must have a PRIMARY KEY column. I've also included a simple working example.
row_to_list <- function(Y) suppressWarnings(split(Y, f = row(Y)))
sql_val <- function(y){
if(!is.numeric(y)){
return(paste0("'",y,"'"))
}else{
if(is.na(y)){
return("NULL")
}else{
return(as.character(y))
}
}
}
to_sql_row <- function(x) paste0("(",paste(do.call("c", lapply(x, FUN = sql_val)), collapse = ", "),")")
bracket <- function(x) paste0("`",x,"`")
to_sql_string <- function(x) paste0("(",paste(sapply(x, FUN = bracket), collapse = ", "),")")
replace_into_table <- function(con, table_name, new_data){
#new_data <- data.table(new_data)
cols <- to_sql_string(names(new_data))
vals <- paste(lapply(row_to_list(new_data), FUN = to_sql_row), collapse = ", ")
query <- paste("REPLACE INTO", table_name, cols, "VALUES", vals)
rs <- dbExecute(con, query)
return(rs)
}
tb <- data.frame("id" = letters[1:20], "A" = 1:20, "B" = seq(.1,2,.1)) # sample data
dbWriteTable(con, "test_table", tb) # create table
dbExecute(con, "ALTER TABLE test_table ADD PRIMARY KEY (id)") # set primary key
new_data <- data.frame("id" = letters[19:23], "A" = 1:5, "B" = seq(101,105)) # new data
new_data[4,2] <- NA # add some NA values
new_data[5,3] <- NA
table_name <- "test_table"
replace_into_table(con, "test_table", new_data)
result <- dbReadTable(con, "test_table")

Related

Writing Unicode from R to SQL Server

I'm trying to write Unicode strings from R to SQL, and then use that SQL table to power a Power BI dashboard. Unfortunately, the Unicode characters only seem to work when I load the table back into R, and not when I view the table in SSMS or Power BI.
require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters
Successfully yields:
> result$Characters
[1] "❤"
However, when I pull that table in SSMS:
SELECT * FROM TestUnicode
I get two different characters:
Characters
~~~~~~~~~~
â¤
Those characters are also what appear in Power BI. How do I correctly pull the heart character outside of R?
It turns out this is a bug somewhere in R/DBI/the ODBC driver. The issue is that R stores strings as UTF-8 encoded, while SQL Server stores them as UTF-16LE encoded. Also, when dbWriteTable creates a table, it by default creates a VARCHAR column for strings which can't even hold Unicode characters. Thus, you need to both:
Change the column in the R data frame from being a string column to a list column of UTF-16LE raw bytes.
When using dbWriteTable, specify the field type as being NVARCHAR(MAX)
This seems like something that should still be handled by either DBI or ODBC or something though.
require(odbc)
require(DBI)
# This function takes a string vector and turns it into a list of raw UTF-16LE bytes.
# These will be needed to load into SQL Server
convertToUTF16 <- function(s){
lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
}
# create a connection to a sql table
connectionString <- "[YOUR CONNECTION STRING]"
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = connectionString)
# our example data
testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
# we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
testData$Char <- convertToUTF16(testData$Char)
# write the table to the database, specifying the field type
dbWriteTable(con,
"UnicodeExample",
testData,
append=TRUE,
field.types = c(Char = "NVARCHAR(MAX)"))
dbDisconnect(con)
Inspired by last answer and github: r-dbi/DBI#215: Storing unicode characters in SQL Server
Following field.types = c(Char = "NVARCHAR(MAX)") but with vector and compute of max because of the error dbReadTable/dbGetQuery returns Invalid Descriptor Index .... :
vector_nvarchar<-c(Filter(Negate(is.null),
(
lapply(testData,function(x){
if (is.character(x) ) c(
names(x),
paste0("NVARCHAR(",
max(
# nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server
# https://github.com/r-dbi/odbc/issues/112
# so we compute the max
nchar(
iconv( #nchar doesn't work for UTF-8 : help (nchar)
Filter(Negate(is.null),x)
,"UTF-8","ASCII",sub ="x"
)
)
,na.rm = TRUE)
,")"
)
)
})
)
))
con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')
DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)
DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))
Inspired by the last answer I also tried to find an automated way for writing data frames to SQL server. I can not confirm the nvarchar(max) errors, so I ended up with these functions:
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
I found the previous answer very useful but ran into problems with character vectors that had another encoding such as 'latin1' instead of UTF-8. This resulted in random NULLs in the database column due to special characters such as non-breaking spaces.
In order to avoid these encoding issues, I've made the following modifications to detect the character vector encoding or otherwise default back to UTF-8 before conversion to UTF-16LE:
library(rlist)
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) {
if (Encoding(y)=="unknown") {
unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
} else {
unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
}
}))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
Ideally, I'd still modify this to remove the rlist dependency but it seems to work now.
You could consider using the package RODBC instead of odbc/DBI. I've have used RODBC with SQL Server and with Microsoft Access as permanent data storage system. I never had trouble with german umlaut (e.g. Ä, ä, ..., ß)
I wonder if using iconv is an appealing alternative as there seem to boe some '\X00' issues (e.g. https://www.r-bloggers.com/2010/06/more-powerful-iconv-in-r/)
I am posting this answer as an Extension to the top answer, because some people might find it useful.
If you need Unicode strings in SQL statements such as INSERT or UPDATE where you cannot use dbWriteTable(), you can constructing your query with dbBind() like this:
x <- "äöü"
x <- iconv(x, from="UTF-8", to="UTF-16LE", toRaw = TRUE)
q <-
"
update foobar
set umlauts = ?
where id = 1
")
query <- DBI::dbSendStatement(con, q)
DBI::dbBind(query, list(x))
DBI::dbClearResult(query)

Creating Tables Iteratively with different columns

I am trying to create a series of tables by group and I would like to have each table run iteratively with a different variable. I would also like to add the p-value from an anova to the bottom of the table. I am able to do a single table easily using the aov function and the kable function. I thought a for loop might work as follows:
#list of column names
varlist <- c("var1", "var2", "var3", "var4", "var5", "var6")
for (var in 1:6){
#anova
aov(varlist[[var]] ~ group, data=dc3)
#pull out pvalue for anova as string
pval <-paste("ANOVA P-Value:", round(summary(fit)[[1]][["Pr(>F)"]][[1]], 3))
# Create Table
Table <- dc3 %>% group_by(group) %>%
summarise(Mean = round(mean(varlist[[var]], na.rm = TRUE),2), SD = round(sd(varlist[[var]], na.rm = TRUE),2))
# Add Pvalue to bottom on table
kable(worst_arr_delays, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_footnote(c(pval2), notation = "symbol")
}
Ideally, this would give me six tables that would look similar to this:
If I am understanding what you want to do, I think that this code could serve you.
library(dplyr)
library(knitr)
library(kableExtra)
dc3 = data.frame(var1=rnorm(40,25,5),var2=rnorm(40,25,5),var3=rnorm(40,25,5),
var4=rnorm(40,25,5),var5=rnorm(40,25,5),var6=rnorm(40,25,5),
group=rep(c("gr.2","gr.3","gr.4","veh"),each=10))
res = NULL
for(i in 1:6){
fit <- aov(dc3[,i]~group,dc3)
pval <-paste("ANOVA P-Value:", round(summary(fit)[[1]][["Pr(>F)"]][[1]], 3))
# Create Table
Table <- dc3 %>% group_by(group) %>%
summarise(Mean = round(mean(dc3[,i],na.rm = TRUE),2), SD = round(sd(dc3[,i], na.rm = TRUE),2))
# Add Pvalue to bottom on table
res <- kable(Table,"html") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_footnote(c(pval), notation = "symbol")
print(res)
}

Oracle to SQL Server large-scale Migration

I am trying to migrate a very large scale of rows and columns dynamically from Oracle to SQL Server. I am trying to automate this for about 20 tables daily.
My process is as follows:
Run Oracle configuration script: runs query which returns the Oracle table data, but the data has duplicates and bad data
Oracle Format Script: calls config script and then fixes the data to be accurate (deletes duplicates and bad data). Returns the data to a .txt with '~' separators
.txt to .csv: (this is my workaround/hack) Using Excel, I open all of the .txt files and use the AutoFormat to change them into perfectly formatted .csv's. (By the manual delimiter option, because when I use the auto-delimiter='~', it comes up with random one letter columns)
R: using the newly generated .csv's, I create table structures based on these headers in SQL Server
R: using the data collected from running the format script for each table, I have ~20 .txt files. I load them into R Studio and then attempt to insert new data using the following script:
Initialization:
RScript which loads the table structure of one .CSV
library(RODBC)
#set table name
tname <- "tablename"
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=dbname;
Uid=username; Pwd=pwd;trusted_connection=yes")
#get df headers from .csv
updatedtables <- read.csv(file = paste("<PATH>", tname, ".csv", sep=""), sep = ",")
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = F, rownames = F, verbose = T, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
odbcClose(conn)
return(0)
I run this code for each .csv to initialize the table structures
Dynamic things:
RScript which attempts to load data from .txt to existing table structures:
library(RODBC)
##~~~~~~~~~~~~~~ Oracle to MSSQL Automation Script ~~~~~~~~~~
#.bat script first runs which updates .txt dumps containing Oracle database info using configuration and format .sql queries for each table.
#function that writes to a table tname using data from tname.txt
writeTable <- function (tname){
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=db;
Uid=uid; Pwd=pw;trusted_connection=yes")
#remove all data entries for table
res <- sqlQuery(conn, paste("TRUNCATE TABLE dbo.", tname, sep = ""))
#load updated data from .txt files
#this skips three because of blank lines and skipping the headers
updatedtables <- read.csv(file = paste("<PATH>", tname, ".txt", sep=""),skip=3, sep = "~")
**ERROR**
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
#close connection
odbcClose(conn)
return(0)
}
#gets all file names and dynamically inputs the data for every file in the path directory
update <- function(){
#insert line separator for log organization which represents new script run
write("-UPDATE--------------------", file = "<PATH>", append = T)
path = "<PATH>"
#save all .txt file names in path to vector
file.names <- dir(path, pattern =".txt")
#go through path's files
for(i in 1:length(file.names)){
#get proper table name of file.names[i]
temp <- gsub(".txt", "", file.names[i])
#call helper func
writeTable(temp)
}
}
#run
update()
This file contains the error. Everything works and loads, except the line save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F), which returns the error
Error in `colnames<-`(`*tmp*`, value = c("QUOTENUMBER", "ENTITYPRODUCTITEMID", :
length of 'dimnames' [2] not equal to array extent
Any suggestions on what to do?
This is an example of the way the text dumps look (without spaces and with dummy data)
http://pastebin.com/kF8yR1R0

T-SQL Temp Tables and Stored Procedures from R using RevoScaleR

In the example below, I was able to get the query to work with one exception. When I use q in place of source.query during the RxSqlServerData step, I get the error rxCompleteClusterJob Execution halted.
The first goal is to use a stored procedure in place of a longer query. Is this possible?
The second goal would be to create and call upon a #TEMPORARY table within the stored procedure. I'm wondering if that is possible, as well?
library (RODBC)
library (RevoScaleR)
sqlConnString <- "Driver=SQL Server;Server=SAMPLE_SERVER; Database=SAMPLE_DATABASE;Trusted_Connection=True"
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
sql_share_directory <- paste("D:\\RWork\\AllShare\\", Sys.getenv("USERNAME"), sep = "")
sqlCompute <- RxInSqlServer(connectionString = sqlConnString, wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(sqlCompute)
#This Sample Query Works
source.query <- paste("SELECT CASE WHEN [Order Date Key] = [Picked Date Key]",
"THEN 1 ELSE 0 END AS SameDayFulfillment,",
"[City Key] AS city, [STOCK ITEM KEY] AS item,",
"[PICKER KEY] AS picker, [QUANTITY] AS quantity",
"FROM [WideWorldImportersDW].[FACT].[ORDER]",
"WHERE [WWI ORDER ID] >= 63968")
#This Query Does Not
q <- paste("EXEC [dbo].[SAMPLE_STORED_PROCEDURE]")
inDataSource <- RxSqlServerData(sqlQuery=q, connectionString=sqlConnString, rowsPerRead=500)
order.logit.rx <- rxLogit(SameDayFulfillment ~ city + item + picker + quantity, data = inDataSource)
order.logit.rx
Currently, only T-SQL SELECT statements are allowed as input data-set, not stored procedures.

Primary keys with Apache Spark

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

Resources