Oracle to SQL Server large-scale Migration - sql-server

I am trying to migrate a very large scale of rows and columns dynamically from Oracle to SQL Server. I am trying to automate this for about 20 tables daily.
My process is as follows:
Run Oracle configuration script: runs query which returns the Oracle table data, but the data has duplicates and bad data
Oracle Format Script: calls config script and then fixes the data to be accurate (deletes duplicates and bad data). Returns the data to a .txt with '~' separators
.txt to .csv: (this is my workaround/hack) Using Excel, I open all of the .txt files and use the AutoFormat to change them into perfectly formatted .csv's. (By the manual delimiter option, because when I use the auto-delimiter='~', it comes up with random one letter columns)
R: using the newly generated .csv's, I create table structures based on these headers in SQL Server
R: using the data collected from running the format script for each table, I have ~20 .txt files. I load them into R Studio and then attempt to insert new data using the following script:
Initialization:
RScript which loads the table structure of one .CSV
library(RODBC)
#set table name
tname <- "tablename"
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=dbname;
Uid=username; Pwd=pwd;trusted_connection=yes")
#get df headers from .csv
updatedtables <- read.csv(file = paste("<PATH>", tname, ".csv", sep=""), sep = ",")
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = F, rownames = F, verbose = T, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
odbcClose(conn)
return(0)
I run this code for each .csv to initialize the table structures
Dynamic things:
RScript which attempts to load data from .txt to existing table structures:
library(RODBC)
##~~~~~~~~~~~~~~ Oracle to MSSQL Automation Script ~~~~~~~~~~
#.bat script first runs which updates .txt dumps containing Oracle database info using configuration and format .sql queries for each table.
#function that writes to a table tname using data from tname.txt
writeTable <- function (tname){
#connect to MSSQL
conn <- odbcDriverConnect("driver={SQL Server};
Server=serv; Database=db;
Uid=uid; Pwd=pw;trusted_connection=yes")
#remove all data entries for table
res <- sqlQuery(conn, paste("TRUNCATE TABLE dbo.", tname, sep = ""))
#load updated data from .txt files
#this skips three because of blank lines and skipping the headers
updatedtables <- read.csv(file = paste("<PATH>", tname, ".txt", sep=""),skip=3, sep = "~")
**ERROR**
#save to MSSQL
save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F)
#add update record in log file
write(paste(Sys.time(), ": Updated dbo.",tname, sep=""), file = "<PATH>", append = TRUE)
#close connection
odbcClose(conn)
return(0)
}
#gets all file names and dynamically inputs the data for every file in the path directory
update <- function(){
#insert line separator for log organization which represents new script run
write("-UPDATE--------------------", file = "<PATH>", append = T)
path = "<PATH>"
#save all .txt file names in path to vector
file.names <- dir(path, pattern =".txt")
#go through path's files
for(i in 1:length(file.names)){
#get proper table name of file.names[i]
temp <- gsub(".txt", "", file.names[i])
#call helper func
writeTable(temp)
}
}
#run
update()
This file contains the error. Everything works and loads, except the line save <- sqlSave(conn, updatedtables, tablename = paste("dbo.",tname, sep = ""), append = T, rownames = F, verbose = F, safer = T, fast = F), which returns the error
Error in `colnames<-`(`*tmp*`, value = c("QUOTENUMBER", "ENTITYPRODUCTITEMID", :
length of 'dimnames' [2] not equal to array extent
Any suggestions on what to do?
This is an example of the way the text dumps look (without spaces and with dummy data)
http://pastebin.com/kF8yR1R0

Related

KeyError while trying to connect to database using pymssql

The below code tries to connect to a mssql database using pymssql. I have a CSV file and I am trying to push all the rows into a single data table in the mssql database. I am getting a 'KeyError' when I try to execute the code after opening the CSV file.
import csv
import pymssql
conn = pymssql.connect(host="host name",
database="dbname",
user = "username",
password = "password")
cursor = conn.cursor()
if(conn):
print("True")
else:
print("False")
with open ('path to csv file', 'r') as f:
reader = csv.reader(f)
columns = next(reader)
query = "INSERT INTO Marketing({'URL', 'Domain_name', 'Downloadables', 'Text_without_javascript', 'Downloadable_Link'}) VALUES ({%s,%s,%s,%s,%s})"
query = query.format(','.join('[' + x + ']' for x in columns), ','.join('?' * len(columns)))
cursor = conn.cursor()
for data in reader:
cursor.execute(query, tuple(data))
cursor.commit()
The below is the error that I get:
KeyError: "'URL', 'Domain_name', 'Downloadables', 'Text_without_javascript', 'Downloadable_Link'"
Using to_sql
file_path = "path to csv"
engine = create_engine("mssql://user:password#host/database")
df = pd.read_csv(file_path, encoding = 'latin')
df.to_sql(name='Marketing',con=engine,if_exists='append')
Output:
InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
I tried everything, from converting the parameters which were being passed to a tuple, passing it as is, but didn't help. Below is the code that helped me fix the issue:
with open ('path to csv file', 'r') as f:
for row in f:
reader = csv.reader(f)
# print(reader)
columns = next(reader)
# print(columns)
cursor = conn.cursor()
for data in reader:
# print(data)
data = tuple(data)
# print(data)
query = ("INSERT INTO Marketing(URL, Domain_name, Downloadables, Text_without_javascript, Downloadable_Link) VALUES (%s,%s,%s,%s,%s)")
parameters = data
# query = query.format(','.join('?' * len(columns)))
cursor.execute(query, parameters)
conn.commit()
Note: The connecting to the database part remains as in the question.

Csv file to a Lua table and access the lines as new table or function()

Currently my code have simple tables containing the data needed for each object like this:
infantry = {class = "army", type = "human", power = 2}
cavalry = {class = "panzer", type = "motorized", power = 12}
battleship = {class = "navy", type = "motorized", power = 256}
I use the tables names as identifiers in various functions to have their values processed one by one as a function that is simply called to have access to the values.
Now I want to have this data stored in a spreadsheet (csv file) instead that looks something like this:
Name class type power
Infantry army human 2
Cavalry panzer motorized 12
Battleship navy motorized 256
The spreadsheet will not have more than 50 lines and I want to be able to increase columns in the future.
Tried a couple approaches from similar situation I found here but due to lacking skills I failed to access any values from the nested table. I think this is because I don't fully understand how the tables structure are after reading each line from the csv file to the table and therefore fail to print any values at all.
If there is a way to get the name,class,type,power from the table and use that line just as my old simple tables, I would appreciate having a educational example presented. Another approach could be to declare new tables from the csv that behaves exactly like my old simple tables, line by line from the csv file. I don't know if this is doable.
Using Lua 5.1
You can read the csv file in as a string . i will use a multi line string here to represent the csv.
gmatch with pattern [^\n]+ will return each row of the csv.
gmatch with pattern [^,]+ will return the value of each column from our given row.
if more rows or columns are added or if the columns are moved around we will still reliably convert then information as long as the first row has the header information.
The only column that can not move is the first one the Name column if that is moved it will change the key used to store the row in to the table.
Using gmatch and 2 patterns, [^,]+ and [^\n]+, you can separate the string into each row and column of the csv. Comments in the following code:
local csv = [[
Name,class,type,power
Infantry,army,human,2
Cavalry,panzer,motorized,12
Battleship,navy,motorized,256
]]
local items = {} -- Store our values here
local headers = {} --
local first = true
for line in csv:gmatch("[^\n]+") do
if first then -- this is to handle the first line and capture our headers.
local count = 1
for header in line:gmatch("[^,]+") do
headers[count] = header
count = count + 1
end
first = false -- set first to false to switch off the header block
else
local name
local i = 2 -- We start at 2 because we wont be increment for the header
for field in line:gmatch("[^,]+") do
name = name or field -- check if we know the name of our row
if items[name] then -- if the name is already in the items table then this is a field
items[name][headers[i]] = field -- assign our value at the header in the table with the given name.
i = i + 1
else -- if the name is not in the table we create a new index for it
items[name] = {}
end
end
end
end
Here is how you can load a csv using the I/O library:
-- Example of how to load the csv.
path = "some\\path\\to\\file.csv"
local f = assert(io.open(path))
local csv = f:read("*all")
f:close()
Alternative you can use io.lines(path) which would take the place of csv:gmatch("[^\n]+") in the for loop sections as well.
Here is an example of using the resulting table:
-- print table out
print("items = {")
for name, item in pairs(items) do
print(" " .. name .. " = { ")
for field, value in pairs(item) do
print(" " .. field .. " = ".. value .. ",")
end
print(" },")
end
print("}")
The output:
items = {
Infantry = {
type = human,
class = army,
power = 2,
},
Battleship = {
type = motorized,
class = navy,
power = 256,
},
Cavalry = {
type = motorized,
class = panzer,
power = 12,
},
}

Writing Unicode from R to SQL Server

I'm trying to write Unicode strings from R to SQL, and then use that SQL table to power a Power BI dashboard. Unfortunately, the Unicode characters only seem to work when I load the table back into R, and not when I view the table in SSMS or Power BI.
require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters
Successfully yields:
> result$Characters
[1] "❤"
However, when I pull that table in SSMS:
SELECT * FROM TestUnicode
I get two different characters:
Characters
~~~~~~~~~~
â¤
Those characters are also what appear in Power BI. How do I correctly pull the heart character outside of R?
It turns out this is a bug somewhere in R/DBI/the ODBC driver. The issue is that R stores strings as UTF-8 encoded, while SQL Server stores them as UTF-16LE encoded. Also, when dbWriteTable creates a table, it by default creates a VARCHAR column for strings which can't even hold Unicode characters. Thus, you need to both:
Change the column in the R data frame from being a string column to a list column of UTF-16LE raw bytes.
When using dbWriteTable, specify the field type as being NVARCHAR(MAX)
This seems like something that should still be handled by either DBI or ODBC or something though.
require(odbc)
require(DBI)
# This function takes a string vector and turns it into a list of raw UTF-16LE bytes.
# These will be needed to load into SQL Server
convertToUTF16 <- function(s){
lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
}
# create a connection to a sql table
connectionString <- "[YOUR CONNECTION STRING]"
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = connectionString)
# our example data
testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
# we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
testData$Char <- convertToUTF16(testData$Char)
# write the table to the database, specifying the field type
dbWriteTable(con,
"UnicodeExample",
testData,
append=TRUE,
field.types = c(Char = "NVARCHAR(MAX)"))
dbDisconnect(con)
Inspired by last answer and github: r-dbi/DBI#215: Storing unicode characters in SQL Server
Following field.types = c(Char = "NVARCHAR(MAX)") but with vector and compute of max because of the error dbReadTable/dbGetQuery returns Invalid Descriptor Index .... :
vector_nvarchar<-c(Filter(Negate(is.null),
(
lapply(testData,function(x){
if (is.character(x) ) c(
names(x),
paste0("NVARCHAR(",
max(
# nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server
# https://github.com/r-dbi/odbc/issues/112
# so we compute the max
nchar(
iconv( #nchar doesn't work for UTF-8 : help (nchar)
Filter(Negate(is.null),x)
,"UTF-8","ASCII",sub ="x"
)
)
,na.rm = TRUE)
,")"
)
)
})
)
))
con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')
DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)
DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))
Inspired by the last answer I also tried to find an automated way for writing data frames to SQL server. I can not confirm the nvarchar(max) errors, so I ended up with these functions:
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
I found the previous answer very useful but ran into problems with character vectors that had another encoding such as 'latin1' instead of UTF-8. This resulted in random NULLs in the database column due to special characters such as non-breaking spaces.
In order to avoid these encoding issues, I've made the following modifications to detect the character vector encoding or otherwise default back to UTF-8 before conversion to UTF-16LE:
library(rlist)
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) {
if (Encoding(y)=="unknown") {
unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
} else {
unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
}
}))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
Ideally, I'd still modify this to remove the rlist dependency but it seems to work now.
You could consider using the package RODBC instead of odbc/DBI. I've have used RODBC with SQL Server and with Microsoft Access as permanent data storage system. I never had trouble with german umlaut (e.g. Ä, ä, ..., ß)
I wonder if using iconv is an appealing alternative as there seem to boe some '\X00' issues (e.g. https://www.r-bloggers.com/2010/06/more-powerful-iconv-in-r/)
I am posting this answer as an Extension to the top answer, because some people might find it useful.
If you need Unicode strings in SQL statements such as INSERT or UPDATE where you cannot use dbWriteTable(), you can constructing your query with dbBind() like this:
x <- "äöü"
x <- iconv(x, from="UTF-8", to="UTF-16LE", toRaw = TRUE)
q <-
"
update foobar
set umlauts = ?
where id = 1
")
query <- DBI::dbSendStatement(con, q)
DBI::dbBind(query, list(x))
DBI::dbClearResult(query)

SQL loop with python and pyodbc

I want to create a query loop, which do interativ steps from one instance to the next. After fetching the right data i will do some calculation and write the new information back in the date base with specific relations.
Example Set up
Firt loop = [Aggregation Level](1, 2, 3)
Second loop = Product(A, B, C)
Time = (2001,2002,2003)
1. First loop
Aggregation level "1", Product "A", Fetch all years (2001, 2002, 2003)
Output = [
(2001,'10 pc','20 €')
(2002,'8 pc','18 €')
(2003,'82 pc','5000 €')]
2. Second loop
Aggregation level "1", Product "B", Fetch all years (2001, 2002, 2003)
Output = [
(2001,'15 pc','35 €')
(2002,'20 pc','100 €')
(2003,'25 pc','5000 €')]
I will use the array (row) and do some calculation. Therefore I will transform the data in a specifc structure after this i will get new figures.
In the end I poste my code and I want to use instead of varible [X, Y] in the select statement the index of the imported excel files as loop parameter or is it better to search inside the data base?
The information already stored in data base.
The variable [X1, Y1] are my new values, which must be written back in the data base with a specific relations, how should I do that to get the right relationship? Primary key or better ways?
Wishes
123GuteLaune
import pyodbc as py
import openpyxl as ol
#Connection string to access
cnxn = py.connect("DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};UID=admin;UserCommitSync=Yes#;Threads=3;SafeTransactions=0;PageTimeout=5;MaxScanRows=8;MaxBufferSize=2048;FIL={MS Access};DriverId=25;DefaultDir=C:\xxx\Desktop;DBQ=C:\xxx\Desktop\DatabaseMA_Test.mdb")
#Connection string to excel
wb = ol.load_workbook('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
X = wb.get_sheet_by_name("Hierarchie")
tuple(X['A2':'A72']).value
wb = ol.load_workbook('C:\xxx\Desktop\Beisspieldatensatz.xlsx')
Y = wb.get_sheet_by_name("Mat_GR")
tuple(Y['A2':'A51']).value
#Output of the loaded Excel data
print (X,Y)
#Load Data out of MS Access
cursor.execute(
"""
SELECT Mat_ID, Monat, Jahr, Umsatzmenge, Umsatz
FROM [TW-DS]
Order BY Jahr ASC 'this is at the moment not working'
WHERE Hier_ID = ?
and Mat_ID= ?
""", [X, Y])
rows = cursor.fetchall()
print(rows)
#Mathematical model unfilled
#Write data back in MS Access
cursor.execute(
"""insert into [TW-DS]
(Outlier,
[Outlier Value])
Values
(X1,
Y2)
""")
cnxn.commit()

How to pass data.frame for UPDATE with R DBI

With RODBC, there were functions like sqlUpdate(channel, dat, ...) that allowed you pass dat = data.frame(...) instead of having to construct your own SQL string.
However, with R's DBI, all I see are functions like dbSendQuery(conn, statement, ...) which only take a string statement and gives no opportunity to specify a data.frame directly.
So how to UPDATE using a data.frame with DBI?
Really late, my answer, but maybe still helpful...
There is no single function (I know) in the DBI/odbc package but you can replicate the update behavior using a prepared update statement (which should work faster than RODBC's sqlUpdate since it sends the parameter values as a batch to the SQL server:
library(DBI)
library(odbc)
con <- dbConnect(odbc::odbc(), driver="{SQL Server Native Client 11.0}", server="dbserver.domain.com\\default,1234", Trusted_Connection = "yes", database = "test") # assumes Microsoft SQL Server
dbWriteTable(con, "iris", iris, row.names = TRUE) # create and populate a table (adding the row names as a separate columns used as row ID)
update <- dbSendQuery(con, 'update iris set "Sepal.Length"=?, "Sepal.Width"=?, "Petal.Length"=?, "Petal.Width"=?, "Species"=? WHERE row_names=?')
# create a modified version of `iris`
iris2 <- iris
iris2$Sepal.Length <- 5
iris2$Petal.Width[2] <- 1
iris2$row_names <- rownames(iris) # use the row names as unique row ID
dbBind(update, iris2) # send the updated data
dbClearResult(update) # release the prepared statement
# now read the modified data - you will see the updates did work
data1 <- dbReadTable(con, "iris")
dbDisconnect(con)
This works only if you have a primary key which I created in the above example by using the row names which are a unique number increased by one for each row...
For more information about the odbc package I have used in the DBI dbConnect statement see: https://github.com/rstats-db/odbc
Building on R Yoda's answer, I made myself the helper function below. This allows using a dataframe to specify update conditions.
While I built this to run transaction updates (i.e. single rows), it can in theory update multiple rows passing a condition. However, that's not the same as updating multiple rows using an input dataframe. Maybe somebody else can build on this...
dbUpdateCustom = function(x, key_cols, con, schema_name, table_name) {
if (nrow(x) != 1) stop("Input dataframe must be exactly 1 row")
if (!all(key_cols %in% colnames(x))) stop("All columns specified in 'key_cols' must be present in 'x'")
# Build the update string --------------------------------------------------
df_key <- dplyr::select(x, one_of(key_cols))
df_upt <- dplyr::select(x, -one_of(key_cols))
set_str <- purrr::map_chr(colnames(df_upt), ~glue::glue_sql('{`.x`} = {x[[.x]]}', .con = con))
set_str <- paste(set_str, collapse = ", ")
where_str <- purrr::map_chr(colnames(df_key), ~glue::glue_sql("{`.x`} = {x[[.x]]}", .con = con))
where_str <- paste(where_str, collapse = " AND ")
update_str <- glue::glue('UPDATE {schema_name}.{table_name} SET {set_str} WHERE {where_str}')
# Execute ------------------------------------------------------------------
query_res <- DBI::dbSendQuery(con, update_str)
DBI::dbClearResult(query_res)
return (invisible(TRUE))
}
Where
x: 1-row dataframe that contains 1+ key columns, and 1+ update columns.
key_cols: character vector, of 1 or more column names that are the keys (i.e. used in the WHERE clause)
Here is a little helper function I put together using REPLACE INTO to update a table using DBI, replacing old duplicate entries with the new values. It's basic and for my own needs, but should be easy to modify. All you need to pass to the function is the connection, table name, and dataframe. Note that the table must have a PRIMARY KEY column. I've also included a simple working example.
row_to_list <- function(Y) suppressWarnings(split(Y, f = row(Y)))
sql_val <- function(y){
if(!is.numeric(y)){
return(paste0("'",y,"'"))
}else{
if(is.na(y)){
return("NULL")
}else{
return(as.character(y))
}
}
}
to_sql_row <- function(x) paste0("(",paste(do.call("c", lapply(x, FUN = sql_val)), collapse = ", "),")")
bracket <- function(x) paste0("`",x,"`")
to_sql_string <- function(x) paste0("(",paste(sapply(x, FUN = bracket), collapse = ", "),")")
replace_into_table <- function(con, table_name, new_data){
#new_data <- data.table(new_data)
cols <- to_sql_string(names(new_data))
vals <- paste(lapply(row_to_list(new_data), FUN = to_sql_row), collapse = ", ")
query <- paste("REPLACE INTO", table_name, cols, "VALUES", vals)
rs <- dbExecute(con, query)
return(rs)
}
tb <- data.frame("id" = letters[1:20], "A" = 1:20, "B" = seq(.1,2,.1)) # sample data
dbWriteTable(con, "test_table", tb) # create table
dbExecute(con, "ALTER TABLE test_table ADD PRIMARY KEY (id)") # set primary key
new_data <- data.frame("id" = letters[19:23], "A" = 1:5, "B" = seq(101,105)) # new data
new_data[4,2] <- NA # add some NA values
new_data[5,3] <- NA
table_name <- "test_table"
replace_into_table(con, "test_table", new_data)
result <- dbReadTable(con, "test_table")

Resources