Writing Unicode from R to SQL Server

Writing Unicode from R to SQL Server - sql-server

I'm trying to write Unicode strings from R to SQL, and then use that SQL table to power a Power BI dashboard. Unfortunately, the Unicode characters only seem to work when I load the table back into R, and not when I view the table in SSMS or Power BI.
require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters
Successfully yields:
> result$Characters
[1] "❤"
However, when I pull that table in SSMS:
SELECT * FROM TestUnicode
I get two different characters:
Characters
~~~~~~~~~~
â¤
Those characters are also what appear in Power BI. How do I correctly pull the heart character outside of R?

It turns out this is a bug somewhere in R/DBI/the ODBC driver. The issue is that R stores strings as UTF-8 encoded, while SQL Server stores them as UTF-16LE encoded. Also, when dbWriteTable creates a table, it by default creates a VARCHAR column for strings which can't even hold Unicode characters. Thus, you need to both:
Change the column in the R data frame from being a string column to a list column of UTF-16LE raw bytes.
When using dbWriteTable, specify the field type as being NVARCHAR(MAX)
This seems like something that should still be handled by either DBI or ODBC or something though.
require(odbc)
require(DBI)
# This function takes a string vector and turns it into a list of raw UTF-16LE bytes.
# These will be needed to load into SQL Server
convertToUTF16 <- function(s){
lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
}
# create a connection to a sql table
connectionString <- "[YOUR CONNECTION STRING]"
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = connectionString)
# our example data
testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
# we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
testData$Char <- convertToUTF16(testData$Char)
# write the table to the database, specifying the field type
dbWriteTable(con,
"UnicodeExample",
testData,
append=TRUE,
field.types = c(Char = "NVARCHAR(MAX)"))
dbDisconnect(con)

Inspired by last answer and github: r-dbi/DBI#215: Storing unicode characters in SQL Server
Following field.types = c(Char = "NVARCHAR(MAX)") but with vector and compute of max because of the error dbReadTable/dbGetQuery returns Invalid Descriptor Index .... :
vector_nvarchar<-c(Filter(Negate(is.null),
(
lapply(testData,function(x){
if (is.character(x) ) c(
names(x),
paste0("NVARCHAR(",
max(
# nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server
# https://github.com/r-dbi/odbc/issues/112
# so we compute the max
nchar(
iconv( #nchar doesn't work for UTF-8 : help (nchar)
Filter(Negate(is.null),x)
,"UTF-8","ASCII",sub ="x"
)
)
,na.rm = TRUE)
,")"
)
)
})
)
))
con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')
DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)
DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))

Inspired by the last answer I also tried to find an automated way for writing data frames to SQL server. I can not confirm the nvarchar(max) errors, so I ended up with these functions:
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)

I found the previous answer very useful but ran into problems with character vectors that had another encoding such as 'latin1' instead of UTF-8. This resulted in random NULLs in the database column due to special characters such as non-breaking spaces.
In order to avoid these encoding issues, I've made the following modifications to detect the character vector encoding or otherwise default back to UTF-8 before conversion to UTF-16LE:
library(rlist)
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) {
if (Encoding(y)=="unknown") {
unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
} else {
unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
}
}))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
Ideally, I'd still modify this to remove the rlist dependency but it seems to work now.

You could consider using the package RODBC instead of odbc/DBI. I've have used RODBC with SQL Server and with Microsoft Access as permanent data storage system. I never had trouble with german umlaut (e.g. Ä, ä, ..., ß)
I wonder if using iconv is an appealing alternative as there seem to boe some '\X00' issues (e.g. https://www.r-bloggers.com/2010/06/more-powerful-iconv-in-r/)

I am posting this answer as an Extension to the top answer, because some people might find it useful.
If you need Unicode strings in SQL statements such as INSERT or UPDATE where you cannot use dbWriteTable(), you can constructing your query with dbBind() like this:
x <- "äöü"
x <- iconv(x, from="UTF-8", to="UTF-16LE", toRaw = TRUE)
q <-
"
update foobar
set umlauts = ?
where id = 1
")
query <- DBI::dbSendStatement(con, q)
DBI::dbBind(query, list(x))
DBI::dbClearResult(query)

Related

snowflake jdbc paramter returning VARCHAR for all datatypes

Snowflake JDBC driver is reporting parameter metadata for all the datatypes as VARCHAR. Is there any way to overcome this problem?
DDL:-
CREATE TABLE INTTABLE(INTCOL INTEGER)
Below is the output from Snowflake ODBC Driver
SQLPrepare:
In:StatementHandle = 0x00000000021B1B50, StatementText = "INSERT INTO INTTABLE(INTCOL) VALUES(?)", TextLength = 42
Return: SQL_SUCCESS=0
SQLDescribeParam:
In:StatementHandle = 0x00000000021B1B50, ParameterNumber = 1, DataTypePtr = 0x00000000001294D0, ParameterSizePtr = 0x0000000000126950,DecimalDigits =0x0000000000126980, NullablePtr = 0x00000000001269B0
Return: SQL_SUCCESS=0
Out:*DataTypePtr = SQL_VARCHAR=12, *ParameterSizePtr = 16777216, *DecimalDigits = 0, *NullablePtr = SQL_NULLABLE=1
Below is Output with Snowflake JDBC Driver.
PreparedStatement ps = c.prepareStatement("INSERT INTO INTTABLE(INTCOL) VALUES(?)");
ParameterMetaData psmd = ps.getParameterMetaData();
for(int i=1 ;i<=psmd.getParameterCount(); i++) {
System.out.println(psmd.getParameterType(i)+ " " + psmd.getParameterTypeName(i));
}
Output:-
12 text

Thank you for adding more information to your thread. I still may be doing a little guesswork though.
If you are trying to change the table values type from Varchar, and there are no values in it, you can drop the table, then re-recreate it.
If you want to ALTER what is already in the table try altering the table first: Manual Reference
There is also the CREATE OR REPLACE TABLE(col , col 2 ) that takes care of both.
Is this what you are looking for?

Outputting SQL Data to a Text File with Python - How to get rid of None?

I’m pulling data from a SQL Server table using pyodbc python code.
In the output file I’m getting records like this:
1, 1, None, None, None, None, None, None
The None values are Null in the SQL table.
I’d like to see records in the text file in this format. I do not want to see the None.
1, 1, , , , , ,
Any ideas how I can do this?
Here is the code I'm using:
import pyodbc
outputfile = 'MyOut.txt'
output_data = open(outputfile, 'w+')
conn=pyodbc.connect(
r'Driver={SQL Server};'
r'Server=MyServer;’
r'Database=MyData;'
r'Trusted_Connection=yes;')
crsr = conn.cursor()
crsr.execute('select * from MyTable’)
for row in crsr:
print(str(row))
outrows = str(row).strip('(')
outrows = outrows.strip(')')
output_data.write(outrows + '\n')
output_data.close()

I understand that outrows is a string, but this would probably be made easier with an list. Aside from that, the output is probably meant to be a string, since you're writing into a txt.
You could modify your for loop as such
for row in crsr:
outrows = str(row).strip("(").strip(")")
line = outrows.split(",")
# creating the array, by splitting the string at each comma
for component in line:
if component == " None":
# with " " as there is most likely a space after the "," in the file
line[line.index(component)] = ""
file.write(",".join(line)+"\n")
I'm afraid I'm not particularly familiar with pyodbc, but I hope this was of help.

Oracle to SQL Server large-scale Migration

I am trying to migrate a very large scale of rows and columns dynamically from Oracle to SQL Server. I am trying to automate this for about 20 tables daily.
My process is as follows:
Run Oracle configuration script: runs query which returns the Oracle table data, but the data has duplicates and bad data
Oracle Format Script: calls config script and then fixes the data to be accurate (deletes duplicates and bad data). Returns the data to a .txt with '~' separators
.txt to .csv: (this is my workaround/hack) Using Excel, I open all of the .txt files and use the AutoFormat to change them into perfectly formatted .csv's. (By the manual delimiter option, because when I use the auto-delimiter='~', it comes up with random one letter columns)
R: using the newly generated .csv's, I create table structures based on these headers in SQL Server
R: using the data collected from running the format script for each table, I have ~20 .txt files. I load them into R Studio and then attempt to insert new data using the following script:
Initialization:
RScript which loads the table structure of one .CSV
library(RODBC)
#set table name
tname <- "tablename"
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=dbname;
Uid=username; Pwd=pwd;trusted_connection=yes")
#get df headers from .csv
updatedtables <- read.csv(file = paste("<PATH>", tname, ".csv", sep=""), sep = ",")
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = F, rownames = F, verbose = T, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
odbcClose(conn)
return(0)
I run this code for each .csv to initialize the table structures
Dynamic things:
RScript which attempts to load data from .txt to existing table structures:
library(RODBC)
##~~~~~~~~~~~~~~ Oracle to MSSQL Automation Script ~~~~~~~~~~
#.bat script first runs which updates .txt dumps containing Oracle database info using configuration and format .sql queries for each table.
#function that writes to a table tname using data from tname.txt
writeTable <- function (tname){
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=db;
Uid=uid; Pwd=pw;trusted_connection=yes")
#remove all data entries for table
res <- sqlQuery(conn, paste("TRUNCATE TABLE dbo.", tname, sep = ""))
#load updated data from .txt files
#this skips three because of blank lines and skipping the headers
updatedtables <- read.csv(file = paste("<PATH>", tname, ".txt", sep=""),skip=3, sep = "~")
**ERROR**
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
#close connection
odbcClose(conn)
return(0)
}
#gets all file names and dynamically inputs the data for every file in the path directory
update <- function(){
#insert line separator for log organization which represents new script run
write("-UPDATE--------------------", file = "<PATH>", append = T)
path = "<PATH>"
#save all .txt file names in path to vector
file.names <- dir(path, pattern =".txt")
#go through path's files
for(i in 1:length(file.names)){
#get proper table name of file.names[i]
temp <- gsub(".txt", "", file.names[i])
#call helper func
writeTable(temp)
}
}
#run
update()
This file contains the error. Everything works and loads, except the line save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F), which returns the error
Error in `colnames<-`(`*tmp*`, value = c("QUOTENUMBER", "ENTITYPRODUCTITEMID", :
length of 'dimnames' [2] not equal to array extent
Any suggestions on what to do?
This is an example of the way the text dumps look (without spaces and with dummy data)
http://pastebin.com/kF8yR1R0

dbHasCompleted always returns TRUE

I'm using R to do a statistical analysis on a SQL Server 2008 R2 database. My database client (aka driver) is JDBC and thereby I'm using RJDBC package.
My query is pretty simple and I'm sure that query would return a lot of rows (about 2 million rows).
SELECT * FROM [maindb].[dbo].[users]
My R script is as follows.
library(RJDBC);
javaPackageName <- "com.microsoft.sqlserver.jdbc.SQLServerDriver";
clientJarFile <- "/home/abforce/mystuff/sqljdbc_3.0/enu/sqljdbc4.jar";
driver <- JDBC(javaPackageName, clientJarFile);
conn <- dbConnect(driver, "jdbc:sqlserver://192.168.56.101", "username", "password");
query <- "SELECT * FROM [maindb].[dbo].[users]";
result <- dbSendQuery(conn, query);
dbHasCompleted(result)
In the codes above, the last line always returns TRUE. What could be wrong here?

The fact of function dbHasCompleted always returning TRUE seems to be a known issue as I've found other places in the Internet where people were struggling with this issue.
So, I came with a workaround. Instead of function dbHasCompleted, we can use conditional statement nrow(result) == 0.
For example:
result <- dbSendQuery(conn, query);
repeat {
chunk <- dbFetch(result, n = 10);
if(nrow(chunk) == 0){
break;
}
# Do something with 'chunk';
}
dbClearResult(result);

How to pass data.frame for UPDATE with R DBI

With RODBC, there were functions like sqlUpdate(channel, dat, ...) that allowed you pass dat = data.frame(...) instead of having to construct your own SQL string.
However, with R's DBI, all I see are functions like dbSendQuery(conn, statement, ...) which only take a string statement and gives no opportunity to specify a data.frame directly.
So how to UPDATE using a data.frame with DBI?

Really late, my answer, but maybe still helpful...
There is no single function (I know) in the DBI/odbc package but you can replicate the update behavior using a prepared update statement (which should work faster than RODBC's sqlUpdate since it sends the parameter values as a batch to the SQL server:
library(DBI)
library(odbc)
con <- dbConnect(odbc::odbc(), driver="{SQL Server Native Client 11.0}", server="dbserver.domain.com\\default,1234", Trusted_Connection = "yes", database = "test") # assumes Microsoft SQL Server
dbWriteTable(con, "iris", iris, row.names = TRUE) # create and populate a table (adding the row names as a separate columns used as row ID)
update <- dbSendQuery(con, 'update iris set "Sepal.Length"=?, "Sepal.Width"=?, "Petal.Length"=?, "Petal.Width"=?, "Species"=? WHERE row_names=?')
# create a modified version of `iris`
iris2 <- iris
iris2$Sepal.Length <- 5
iris2$Petal.Width[2] <- 1
iris2$row_names <- rownames(iris) # use the row names as unique row ID
dbBind(update, iris2) # send the updated data
dbClearResult(update) # release the prepared statement
# now read the modified data - you will see the updates did work
data1 <- dbReadTable(con, "iris")
dbDisconnect(con)
This works only if you have a primary key which I created in the above example by using the row names which are a unique number increased by one for each row...
For more information about the odbc package I have used in the DBI dbConnect statement see: https://github.com/rstats-db/odbc

Building on R Yoda's answer, I made myself the helper function below. This allows using a dataframe to specify update conditions.
While I built this to run transaction updates (i.e. single rows), it can in theory update multiple rows passing a condition. However, that's not the same as updating multiple rows using an input dataframe. Maybe somebody else can build on this...
dbUpdateCustom = function(x, key_cols, con, schema_name, table_name) {
if (nrow(x) != 1) stop("Input dataframe must be exactly 1 row")
if (!all(key_cols %in% colnames(x))) stop("All columns specified in 'key_cols' must be present in 'x'")
# Build the update string --------------------------------------------------
df_key <- dplyr::select(x, one_of(key_cols))
df_upt <- dplyr::select(x, -one_of(key_cols))
set_str <- purrr::map_chr(colnames(df_upt), ~glue::glue_sql('{`.x`} = {x[[.x]]}', .con = con))
set_str <- paste(set_str, collapse = ", ")
where_str <- purrr::map_chr(colnames(df_key), ~glue::glue_sql("{`.x`} = {x[[.x]]}", .con = con))
where_str <- paste(where_str, collapse = " AND ")
update_str <- glue::glue('UPDATE {schema_name}.{table_name} SET {set_str} WHERE {where_str}')
# Execute ------------------------------------------------------------------
query_res <- DBI::dbSendQuery(con, update_str)
DBI::dbClearResult(query_res)
return (invisible(TRUE))
}
Where
x: 1-row dataframe that contains 1+ key columns, and 1+ update columns.
key_cols: character vector, of 1 or more column names that are the keys (i.e. used in the WHERE clause)

Here is a little helper function I put together using REPLACE INTO to update a table using DBI, replacing old duplicate entries with the new values. It's basic and for my own needs, but should be easy to modify. All you need to pass to the function is the connection, table name, and dataframe. Note that the table must have a PRIMARY KEY column. I've also included a simple working example.
row_to_list <- function(Y) suppressWarnings(split(Y, f = row(Y)))
sql_val <- function(y){
if(!is.numeric(y)){
return(paste0("'",y,"'"))
}else{
if(is.na(y)){
return("NULL")
}else{
return(as.character(y))
}
}
}
to_sql_row <- function(x) paste0("(",paste(do.call("c", lapply(x, FUN = sql_val)), collapse = ", "),")")
bracket <- function(x) paste0("`",x,"`")
to_sql_string <- function(x) paste0("(",paste(sapply(x, FUN = bracket), collapse = ", "),")")
replace_into_table <- function(con, table_name, new_data){
#new_data <- data.table(new_data)
cols <- to_sql_string(names(new_data))
vals <- paste(lapply(row_to_list(new_data), FUN = to_sql_row), collapse = ", ")
query <- paste("REPLACE INTO", table_name, cols, "VALUES", vals)
rs <- dbExecute(con, query)
return(rs)
}
tb <- data.frame("id" = letters[1:20], "A" = 1:20, "B" = seq(.1,2,.1)) # sample data
dbWriteTable(con, "test_table", tb) # create table
dbExecute(con, "ALTER TABLE test_table ADD PRIMARY KEY (id)") # set primary key
new_data <- data.frame("id" = letters[19:23], "A" = 1:5, "B" = seq(101,105)) # new data
new_data[4,2] <- NA # add some NA values
new_data[5,3] <- NA
table_name <- "test_table"
replace_into_table(con, "test_table", new_data)
result <- dbReadTable(con, "test_table")